The Apache Software Foundation governs the development of some of the most popular open source projects out there, including Tomcat, Solr, and Tika. As an active contributor to open source projects, I have come across a lot of people with questions about how to contribute to open source projects in general. To help answer some of these questions I have put together a short guide on how to contribute to Apache open source projects.
This is a definitive guide to contributing to Apache Open Source projects. It is not intended to be a how-to with specific steps for each project, as that is impossible. Instead it attempts to provide an overview of the things you will encounter as well as a bit of history. While this is certainly biased from my experiences, it shouldn’t be. I have striven to cite many other sources so if you disagree with something, start there.
Apache Superset was the only individual project to see over 10,000 commits – and has attracted nearly 43,000 stars on GitHub (and 8,400 forks). The open-source data visualisation and data exploration platform is enterprise-ready and can, its advocates claim, “replace or augment proprietary business intelligence tools.”
It provides a no-code interface for building charts; an API for programmatic customisation; a web-based SQL editor for advanced querying; a lightweight semantic layer for defining custom metrics; out-of-the-box support for MySQL, PostgreSQL and SQLite (“pretty much any databases that have a SQLAlchemy integration should work perfectly fine as a data source”); a range of visual templates from bar charts to geospatial visualisations; a cloud-native architecture and highly extensible security and authentication options for administrators.
Apache Superset is in use at a wide range of companies. According to one of its early committers, Bogdan Kyryliuk, these include Airbnb, American Express, Lyft, Nielsen, Rakuten Viki, Twitter, and Udemy.
It’s also used as the main data exploration tool at Dropbox, which consolidated 10 tools to Superset.
Step by Step Guide on How to Contribute to Open Source
When we say contributing to open-source, it does not necesarilly mean that you need to know how to code. There are different ways in which you can contribute even if you are a non-coder – but having some coding skills will help you (and the projects) out a lot.
Some common contributions can be through:
- Adding a description to a project’s documentation to elaborate on a certain point, mostly referred to as a README file (check this guide on how to write a Good README file).
- Giving guidance on a specific project and how to use it.
- Adding sample output to show how the code works.
- Writing in-depth tutorials for the project.
- Adding translation for a project – A good place to start with this might be with the freeCodeCamp’s translation program.
- Answering questions about a project (like on Stack Overflow or Reddit)
- You can offer to mentor another contributor.
- You can fix typos and arrange the project’s work folder correctly.
All these ways, and many more, count towards contributions. Now what exactly should you know before you start contributing to an OS project?
What we can do with ASF JIRA?
So with the ASF JIRA we can search for issues that other people have reported, we can report our own issues, we can comment on issues, we can ask people for clarification or help. We can make sure @mentions work here too. So if there’s someone who would be really good at solving a particular issue you can loop them in.
But there’s a few things that we can’t do with the ASF JIRA that you might be used to doing with your own bug tracker at home. One of them is assigning issues to ourselves. Because we don’t want issues to get stuck and be like assigned to someone who ends up getting distracted when they do need to do something else. We don’t assign issues to people really until the issue is resolved. Or maybe sometimes during the pull request phase. Posting long designed documents, JIRA isn’t great for that, certainly people do. I’d encourage you to do a Google Doc or something where people can comment more interactively. And there are tags and in theory you can tag them, but in practice they often get removed. So I wouldn’t bother putting tags on issues.
If you haven’t seen JIRA before, this is what the ASF JIRA looks like. We can see there is an issue database can’t be changed if it’s specified in URL. I mean to me, that sounds like not an issue but the person who reported it thinks it’s a major bug. So I probably don’t know enough of the details and I just be like “Oh, I don’t know,” and I’d start reading to see what it is they wanted.
Apache Airflow saw over 5,600 commits in 2021 – and has attracted 24,000+ stars and 9,800+ forks on GitHub.
Another project born at Airbnb, Airflow is a platform to programmatically author, schedule, and monitor data workflows and is often used to orchestrate disparate tools into a single place – something which might (for AWS users) otherwise mean manually managing schedules via AWS Datapipelines, Lambdas, and ECS Tasks.
Much-used by data engineers (the majority of whom, for what it’s worth, are running PostgreSQL as their “meta-database”, with growing numbers using some kind of Kubernetes deployment), it lets users automate processes in Python and SQL and has a rich web UI to visualise, monitor and fix issues; it also boasts a comprehensive set of integrations with popular cloud services, from Azure Container Instances, AWS Glue, GCP Text-to-Speech and beyond. One of the best guides for new users is by data scientist Rebecca Vickery and can be read here. You can also refer to this Stack Overflow thread for more details on deployment.
Choosing a component?
If we decide to work in Core, it is going to take a lot of time to get our code reviewed. And that’s because some pieces of Core aren’t tested as well as they should be. And everything depends on top of Core, so making changes here could have a very big impact and that could be good or that could be like really bad. And so people are a little bit more conservative here. And so it’s not the area that I’d recommend when you’re starting out. ML and MLlib , MLlib is, good bye Mllib. But our machine learning stuff, improving existing algorithms very much welcome, adding new algorithms has not been a thing that’s been encouraged for a long time. And if you wanna the reasons more or less comes back to maintenance, but we can talk about it off line. Structured Streaming, lots of fun lots of changes happening there and it’s a little bit difficult to contribute because the API is changing so rapidly but I think it’s a really great area. SQL is also pretty cool, not just the streaming SQL part but just the base SQL layer. Lots of fun stuff, very active. I don’t do that much work there but I know many people who do and they enjoy it. I’m gonna try and convince you to contribute to improving the Python support ’cause that’s one of the things that I care about a lot. And it’s one of the easiest areas to get started and we’ll talk about that later in fact.
And then we can also contribute to the different scheduler backends if you’re a YARN expert, please come, if you’re an expert in Mesos please come. We really need someone who’s Mesos expert, who’s contributing. Kubernetes, lots of active work, lots of reviewers, it’s a lot of fun, you can come and hang out together and contribute to Spark and Kubernetes. Standalone, there’s less people excited about it because increasingly Spark is deployed on top of one of these other systems.
Apache Pulsar saw over 4,600 commits in 2021 – and has attracted 10,200+ stars and 2,600+ forks on GitHub.
Apache Pulsar’s got pretty hot. We’ve written about it before here and here. A publish-subscribe (“pub-sub”) toolkit, it underpins messaging for event-driven systems and/or streaming analytics; it can also be used to decouple applications to boost performance, reliability and scalability. Similar though that may sound to Apache Kafka, there are pronounced differencesr, including how Pulsar separates compute and storage.
Pulsar, for example, delegates persistence to another system called Apache BookKeeper (a dedicated separate low-latency storage service designed for real-time workloads), and its “brokers” on the other hand are stateless — they are not responsible for storing messages on their local disk. (Pulsar brokers run an HTTP server with a REST interface for admin and topic lookup, and a dispatcher to handle all message transfers.)
As Jaroslaw Kijanowski, a developer at SoftwareMill notes tidily “[Statelessness] makes spinning up or tearing down brokers much less of a hassle… The separation between brokers and the storage allows to scale these two layers independently. If you require more disk space to store more or bigger messages, you scale only BookKeeper. If you have to serve more clients, just add more Pulsar brokers. With Kafka, adding a broker means extending the storage capacity as well as serving capabilities”.
Pulsar is at the heart of Yahoo! owner Verizon Media’s own architecture, where it handles hundreds of billions of data events each day. (Yahoo! developers described it in 2018 as “an integral part of our hybrid cloud strategy [that] enables us to stream data between our public and private clouds and allows data pipelines to connect across the clouds.”) It has also been deployed at COMCAST, Huawei, Splunk, and beyond.
Apache is one of the largest and most widely used open source software foundations in the world. The Apache Software Foundation takes great pleasure in providing communities of developers with the infrastructure and support necessary to enable their work. As a member of these communities, you will be leveraging their work to build your own applications.