This spring the NumFOCUS Board of Directors awarded targeted small development grants to applicants from or approved by our sponsored and affiliated projects. In the wake of a successful 2016 end-of-year fundraising drive, NumFOCUS wanted to direct the donated funds to our projects in a way that would have impact and visibility to donors and the wider community. Each grant will help the recipient project to produce a clear outcome, achievable within 2017.
Just over $13,000 was awarded in grants to support the following projects:
Widening platform availability for MDAnalysis: Full Python 3 Support
This project aims to include full Python 3 support for MDAnalysis; at the moment, only Python 2.7 is fully supported. Although about 80% of code passes unit tests in Python 3, we urgently need to close the remaining 20% gap in order to support our user base and to safeguard the long term viability of the project. MDAnalysis started almost 10 years ago when Python was around version 2.4 and interfacing with existing C code was mostly done with writing C-wrappers that directly used CPython. This legacy code has hampered a speedy full transition to Python 3 and consequently MDAnalysis lags behind the rest of the scientific Python community in fully supporting Python 3.
The grant will support a final focused drive to complete support of Python 3 while also remaining compatible with Python 2.7 for as long as it is officially supported (2020). To do this, two core developers (Richard Gowers (RG) & Tyler Reddy (TR)) will visit Arizona State University to work on the issue full-time for 2 weeks. The output of this project will be merged into the development branch and will be included in the existing Travis CI build matrix. This will then be one of the key features to be included in the upcoming 0.17 release which is targeted for September 2017 (to coincide with the inclusion of the anticipated outputs from Google Summer of Code 2017 projects). For MDAnalysis it is vital to fully support Python 3 in order to maintain and grow its user and developer base. The work supported by the grant will put MDAnalysis on track with the rest of the scientific Python community, increase package interoperability, and promote the overall move towards Python 3.
h5py backend for PyTables
The goal is to define a new way to access I/O that would allow a new version of PyTables to use different backends. The main priority is for interfacing h5py so as to allow HDF5 access through it. This way PyTables can leverage h5py to access the most advanced features of HDF5 while still delivering features like advanced table management, fast table queries and easy access to advanced Blosc meta-compressors.
The goal is to define a new way to access I/O that would allow a new version of PyTables (probably v4.x) to use different backends. As h5py is a great interface for HDF5, the main priority is for interfacing h5py so as to allow HDF5 access through it. This way PyTables can leverage h5py to access the most advanced features of HDF5 while still delivering features like advanced table management, fast table queries and easy access to advanced Blosc meta-compressors (and with it, to a wide array of codecs, like LZ4, Snappy and Zstandard). You can see a more detailed blog about our vision here. In fact, work has already started on that front: in August 2016 a handful of PyTables core developers gathered with the goal to start this precise task, and although they certainly made a lot of progress on the Table object (the fundamental one in PyTables), there is still quite a bit of work to do. This grant will allow PyTables to continue the job done till now and release an alpha release with the basic Table, CArray, EArray and VLarray objects working, plus hopefully get some traction for promptly releasing a stable version unifying the best of PyTables and h5py packages. The grant work is meant to address project 1) here.
With this approach, PyTables and h5py will be close to complementary instead of having overlapping functionalities. This overlapping leads to redundant effort for both core developers and community users of PyTables and h5py; moreover, there are two places where bugs could be reported, two places where nasty unicode issues could come up, two handles to your HDF5 files in memory, and so on. The grant will allow a more uniform API for HDF5 files.
Text Analytics Introductory Course for Social Scientists
Text mining and machine learning are not taught to social scientists at Slovenian universities, and few students and professors in this area know about their potential for research. The workshop will be focused on teaching the participants the core data mining methods and how to combine them with text analytics. The entire workshop will be hands-on — we will use our own tool, Orange, that offers components for text mining, visualization and deep learning-based embedding within an easy-to-use visual programming environment. Sections of Orange were specifically designed for teaching, and while they have been tested in workshops for engineers and biomedical researchers, this will be the first time we will prepare the course for social scientists.
At the workshop, participants will actively construct analytical workflows and go through case studies with the help of the instructors. They will learn how to manage textual data, preprocess it, use machine learning, data projection and visualization techniques to expose hidden patterns and evaluate the resulting models. At the end of the workshop, the participants will know how to use visual programming to seamlessly construct data analysis workflows with textual data.
The workshop will extend our existing hands-on course materials to cover digital humanities, and two case studies prepared for the course will be made available on Orange’s YouTube channel.
The goal of this project is to support additions to the `numexpr` module. NumExpr is a core module within the PyData ecosystem. It compiles Python code passed as strings into a program which is then run through a virtual machine written in C. The virtual machine efficiently blocks and threads NumPy-like array calculations on modern, multi-core processors. Due to limitations of the original NumExpr module, starting in late 2016, we began a re-write of NumExpr which is currently under development as the NumExpr-3.0 (NE3) branch.
At present the version 3.0 development branch of NumExpr (NE3) is in an alpha state and is not ready for production use. In spite of that, several individuals have already tried to use the alpha, due to the large number of improvements offered. R.A. McLeod proposes the following pushes be undertaken to move NE3 into a state suitable for public use:
- Analysis of the NumExpr program to determine the broadcasted size of the output array. This will be implemented within the C-module similar to `numpy.broadcast` but without the generation of Python objects, for speed reasons.
- Fixing of bugs found by the automated test submodule, and working continuous integration on Appveyor (Windows) and Travis CI (Linux and OS-X).
- Documentation generated through Sphinx and pushed automatically to ReadTheDocs.org — in particular, a tutorial on how to add custom operations/functions to the code generator will be provided.
The target goal is to release the NE3 branch as `numexpr3` via PyPI and conda. Eventual replacement of NumExpr 2.6 with NE3 will be dependent on the re-implementation of `bytes` and `unicode` strings support (~1-2 weeks of work) and reduction operations (i.e. min, max, sum) (~3-5 weeks of work), to be undertaken as circumstances permit.
NumExpr is a requirement of NumFOCUS-supported modules Pandas and PyTables, where it is used for evaluations and queries for large arrays. NE3 will extend support for a greater proportion of NumPy to these operations. NE3 is significantly faster than NE2.6 for most use cases. Support for extended syntax such as assignment and multi-line expressions will permit more complicated algorithms to be implemented within the context of Pandas and PyTables evaluate function calls. In general, there are not many solutions for parallel processing in Python that use threads instead of processes. Multi-processing is simpler and more flexible, but suffers from increased memory consumption and setup time. NE3 is particularly fast regarding setup time and can potentially be used in hybrid multi-processing/threading schemes in the future. It has only NumPy as a dependency, which makes it highly portable.
SymPy 1.1 Release Support
The SymPy project had its most recent stable release, 1.0, in March 2016. Since then, almost 500 pull requests have been merged into the development branch, containing numerous bug fixes and improvements, including work from six Google Summer of Code projects. A new release of the library is overdue, but unfortunately considerable work is required to do such a release and the current project leadership does not have dedicated time and effort to perform a release. The small development grant will allow full-time work to be done for a release of SymPy. The release process has several components, including:
- Writing release notes for the aforementioned approximately 500 changes since version 1.0.
- Making sure any blocking issues are fixed before the release is made. A blocking issue is an issue that must be fixed before the release can be made, for instance, because of a regression since the previous release, or to prevent an API break. These blocking issues generally require development work to fix, and are often the most time consuming part of the release process.
- Update the automated script that does the actual release. The script SymPy has used in the past requires maintenance to be usable for this release.
- Make a release candidate, and fix any major issues that are found in it by the community.
- Make the final release, including uploading the tarball to PyPI, and updating various websites. This final step is mostly automated, and is the easiest part of the process. It is the work that must be done before the release is made to make sure that it is stable and doesn’t break things for our users that is the most time consuming.
Without funding, the SymPy release can only be done in volunteer time, meaning the actual release date would likely be pushed forward even further. The community will benefit the most, as the typical SymPy user relies on the stable released version from pip or conda. Many important improvements and bug fixes have been made since version 1.0, and many users are forced to work against the development branch of SymPy. While we do keep our development branch usable, we would prefer if most end-users used the stable release, as it is easier to install, and the development branch can be subject to small regressions and API changes as development happens.
American Meteorological Society Short Course on Open Source Radar Software
This course will take place on August 27th 2017, the day before the 2017 AMS Radar Conference starts. We will first introduce the participants to the overall process of Open Source Software development. This will include introduction to the Git utility, the GitHub platform, and good usages like forking, pull request submission and issue reporting. In the second part ,we will present several Open Source Radar software packages; in particular, we will present in detail Py-Art, ARTView, BALTRAD and wradlib that have been partially developed by the project team. This will include presentation of the functionalities as well as the development history and lessons learned. We close the day with a discussion and feedback.
The Organization costs of this Course will be covered by the inscription fee, all of which will be managed by the AMS organization. The team listed here will work voluntarily. In order to make the course more accessible, we will use the NumFOCUS grant to reduce the inscription fee. This is justifiable since our focus is not to present tutorials for the use of the presented software, but rather to inform the members of the radar community of the Open Source development process by our example: the methods used, the lessons learned and results archived.
At the same time, while none of the software presented are directly affiliated with NumFOCUS, they are all based on affiliated and supported projects. Py-ART is written in Python and make extensive use of NumPy and SciPy, and fall under the scope of the last as it is essentially a library to access, modify and analyze scientific data. It also makes use of Cython and it is distributed with CONDA. ARTView is a visualization tool that builds on Py-Art. While being mainly based in PyQt for the User Interface, it uses Matplotlib for the plotting process and Spicy for some analyzes. BALTRAD makes heavy use of Python and NumPy, while wradlib is based on Python and SciPy.