Open Source

First Text Analysis Python Project was my first unstructured text analysis project.

My open-source contributions were in different languages. Tools and languages that were used are listed at the end of each contribution description.

Pandas and pandas-datareader

While using Pandas in a stock screening application,
encountered problems that were present only for certain data values.
The first solution was a workaround. It worked, but the solution was rather cumbersome and ugly.
And what about existing code that might work with some data but not with other.
 Modifying Pandas code looked like a better solution.
 The changes actually affected two projects: Pandas and pandas-datareader, as pandas-datareader is now a stand alone project.
Below are the changes:
Made improvements to missing values handling, testing, documentation.
A signature-preserving decorator for Python 2.
Some changes were an API changes, which also included changes to what's new and documentation.
Provided a workaround for one of the issues as a temporary solution while the issue is being resolved, participated in discussions.
Python, Cython, pytest, Sphinx, reStructuredText.

Pandas commits

https://github.com/pandas-dev/pandas/commits?author=olegshteynbuk

pandas-datareader commits

https://github.com/pydata/pandas-datareader/commits?author=OlegShteynbuk

    Pandas pull requests
https://github.com/pandas-dev/pandas/pull/16589
https://github.com/pandas-dev/pandas/pull/16534

    Pandas issues
https://github.com/pandas-dev/pandas/issues/16606
https://github.com/pandas-dev/pandas/issues/16471

    pandas-datareader pull requests:
https://github.com/pydata/pandas-datareader/pull/357
https://github.com/pydata/pandas-datareader/pull/364 (merged from a command line and GitHub GUI  incorrectly displays it as not merged but correctly shows it as merged in the source code
https://github.com/pydata/pandas-datareader/commit/6cce5f18d52be802c7245c8a28d534236a9e2b24 )

    pandas-datareader issues
https://github.com/pydata/pandas-datareader/issues/363
https://github.com/pydata/pandas-datareader/issues/342

              Arelle
    Arelle pull requests
https://github.com/Arelle/Arelle/pull/98

              geWorkbench
At the time I was at The Center for Computational Biology and Bioinformatics(C2B2), Columbia University.
Most of my work was on geWorkbench, a Java-based open-source desktop application for integrated genomics.
While on geWorkbench I was involved in new features, design, enhancements, and bug fixes.
There was some server side work too, mostly on a grid based application CaArray, which was an open-source web and programmatically accessible array data management system. Part of my CaArray responsibilities included installation and administration.
Some of the tasks required advanced calculus and mathematical modeling.
Java, C++/C, Swing, JBoss, Tomcat, Ant.

GitHub Gist has Python and Java code.

Python and Pandas program to calculate projections of when hedge funds investors will receive their investments over time.

The program reads data from an Excel file, which should have at least two worksheets: one for the funds liquidity terms and the other for the tranche investments. There are three scripts to produce reports and graphs. The focus of the program is on the most common hedge funds withdrawal restrictions. A more detailed description is in the HedgeFundsRedemption.md file.

NYSE trading days for backtesting

This is a fork from jckantor for Python dateutil rule sets for NYSE trading days and holiday observances. The original rules are valid for time from now on. Sometimes for backtesting or pattern recognition there is a need to have NYSE trading days for the past several years. The rules were modified to produce NYSE trading days and holiday observances from 1986 and on.

Java program to perform several dataset operations

The dataset is in a CSV (comma separated) file. The program reads the dataset into a dataframe and performs some operation on the dataframe. The program can be used as a library or from a command line. The user can test the code from a command line using simple language to define the operations. The purpose of this exercise is to demonstrate that in the absence of the library like Pandas in Java for any advanced data processing job you are better off using Python and Pandas, even if it involves learning a new language. However, for a Java programmer who doesn't know Python and needs relatively simple dataset column operations, it might make sense to use something like this. README file goes into more details about implementation and various limitations even for this tiny subset of Pandas functionality.

Other contributions

Leo outlining editor

Submitted a bug fix for integration with aspell, spellchecking C++ library. Several years ago Leo switched from aspell to PyEnchant.

Python, C++.

Submitted code patches to Apache Axis to improve AxisFault logging and provided a workaround.

Java.

Filed bug reports against Java Swing, JBoss, and several others Java tools and libraries.

View comments.

more ...

Ant Script to Update/Install Eclipse

The script, which has been written several years ago and hosted on this site, is now on GitHub.

At the time of this writing Eclipse was relatively new and as a new platform was lacking some important features that were left to plugin writers. Many Eclipse users were waiting anxiously for a new Eclipse build or a new version of a popular Eclipse plugin.

However, after a new version arrived, update was not very straightforward, as there were too many rules to follow and Eclipse documentation was not as good as it is now. Plugins usually were downloaded and installed manually, and plugin writers were using different conventions to package their plugins. Occasionally a new version was incompatible with the old one in such a way that update could make your workspace unusable. The script addressed all these issues and was easily customizable; sometimes you just need a command line tool.

It is difficult to say for sure how widely this script has been used. At the time Apache Ant was relatively new and used as a build tool, a replacement for the UNIX make utility. Some indications of the script usage could be emails from the users with questions, feature requests, suggestions, and code contributions. The script was reviewed in several blogs and newsgroups, and googling at the time for "Ant script" produced results with this script having the highest rank.

The following text has been written several years ago, at the same time as the script; and even if some statements might look slightly dated, I have decided to leave the original text intact to preserve the spirit of the time:

Eclipse is a great IDE. It is relatively new and the speed with which Eclipse team introduces new features is amazing. However, my first update to a new Eclipse build was rather time consuming; and from reading news I realized that other developers were struggling with updates too. The script simplified updates and, over time, evolved to incorporate some other related activities. All the documentation is in the Readme file. One chapter in the Readme file, named What is the Right Way to Update?, is different from the other chapters as this chapter is not about how to use the script but rather why and when to use it.

You can download a zip file from this site or just browse Readme file online. Latest changes are in Release Notes. It is worth noting that Eclipse comes with the Update Manager but it seems to be for major releases and not for builds. The primary audience for this script is developers who update Eclipse quite often.

Read what others think:

http://youarenumber6.blogspot.com/2004/08/departmental-eclipse-with-ant.html

http://www.jroller.com/page/dorodok/20030106

View comments.

more ...