Tea

Tea was always a big part of my life.

As probably is the case with many in the Western world, I started with black tea, in China is known as red tea. After trying nearly all types of tea I settled on Pu-erh tea, in China is often referred to as black tea.

Pu-erh is very earthy and warm. Aroma and taste could vary significantly depending on the area where it came from, even mountain slope matters, as well as processing, storage, year, factory, and many other factors - pretty much like wine.

Another great thing about pu-erh is that you can age it, and the tea is getting better with time, like scotch. Isn't it great? ... Most of the times it is, but sometimes you got a miss: a tea that tasted better when young.

I drink other types of tea too but not as often as pu-erh. If for some reason I will run out of pu-erh (don't think that it will ever happen as I have enough pu-erh not to worry about buying it anymore) next in line will be oolong or green, then black, and white.

Most of my tea comes from China, and I usually make tea using a slightly modified version of Gongfu cha. Later I will post pictures of my tea collection; currently, I am in the process of retasting all my tea and updating the corresponding notes.

And in case you haven't noticed it yet, this site's bootstrap theme is inspired by tea.

View comments.

more ...

Web Development


In the early days of web development, it was the server that generated web pages and the browser just displayed them. Now the task of generating web pages is split between the server and the client(browser), and there are a lot of discussions on the web what should be done on the server and what on the client; and a big part of it is communication between the client and the server. My experience was more on the server side, but I have done a fair share of the client side development too.

Web scraping, which I have done a fair amount of, is not exactly web development, more like analyzing website and extracting data from it; but there is an intersection of technical skills with web development that justifies putting it here instead of creating a separate blog entry. And the more complicated and dynamic the client site is, the more difficult the scraping is.

Below are some of the libraries and tools that I use:

Django, Flask, Redis, RQ, Celery, RabbitMQ, Docker, Heroku, Netlify, Pelican, HTML, CSS, JavaScript, jQuery, AJAX, Bootstrap, reStructuredText, REST, Beautiful Soup, Jsoup, Selenium, XML, JSON, YAML. JBoss, Tomcat, Glassfish.

View comments.

more ...

Python


Several years ago, when I was doing mostly Java development, we needed a data manipulation and analysis tool, and the Python library Pandas came out as a clear winner. Pandas is the world in itself, and learning Pandas arguably requires more efforts than learning Python. I have used Pandas rather extensively and made several contributions to Pandas (v0.21.0) and pandas-datareader (v0.5.0), more details about these contributions are here

Occasionally I used Python before, but my knowledge of Python was rather superficial, which didn't stop me many years ago from contributing to an Open Source project that was in Python. My contribution was about interoperability with C++ library, so it might had been more C++ than Python.

I used various Python modules, packages, and tools including multithreading and concurrency. My first Python multi-threaded program was actually running slower than the single-threaded one.

Coming from Java, it was not what I expected. As it turned out the Python interpreter uses a Global Interpreter Lock (GIL), and the reason for this is that the Python interpreter is not thread safe(I use CPython). You need an I/O bound program to see the benefits of multithreading in Python. There are debates on the web about the GIL with a lot of proposed solutions and each of the solutions has its pros and cons.

In our case, the solution was to use multiprocessing, which scales well for distributed processing. Another solution would be to do computationally intensive tasks in C++/C and use Cython. As some of our tasks turned out to be I/O bound we added the task execution section to the system configuration. The user can configure the task execution engine, like the number of processes or/and threads, and to run experiments to find out the optimal number of processes/thread; by default, there are no threads - only processes. Custom solution using modules from the Python standard library was developed first and later Redis, RQ based solution was added. We are also considering using a distributed framework like Spark or Dask.

We also use Python as a portable scripting language, as our scripts run on Linux and Windows.

Moving from Python 2 to Python 3 was a rather big project. The first step was to make the source code both Python 2 and 3 compatible; this eliminated any need to maintain two different code bases during the transition. It was also a good opportunity to go over the code and do some cleaning and refactoring, and was especially true for the code that was written when we were new to Python. The transition to Python 3 was helped by the fact that we were preparing for transition for some time and all new code was written with Python 3 in mind, whenever it was possible.

More detailed description of our transition experience deserves its own entry. Here I just want to mention one side effect of the transition - the code readability. Probably after any code conversion, you will end up with a less readable code, especially when using conversion tools. And code readability is one of the main reasons that made Python so popular. A big part of the conversion was an intermediate step, making the code Python 2 and Python 3 compatible. This intermediate step introduced some extra code that is not needed anymore, we are Python 3 only from now on and are very happy with the results of the transition, just need to find some extra time for the code cleaning.

Later I will describe some of the cool features of Python with examples from my projects, probably in a separate entry. A big part of my Python development is data processing and web development, along with various scripts. Some of my data processing projects might require a separate entry, like, extracting financial data from EDGAR regulatory filings (XBRL format, unstructured text)

First Text Analysis Python Project was my first unstructured text analysis project.

Some samples of Python code are on GitHub Gist

The program reads data from an Excel file, which should have at least two worksheets: one for the funds liquidity terms and the other for the tranche investments. There are three scripts to produce reports and graphs. The focus of the program is on the most common hedge funds withdrawal restrictions. A more detailed description is in the HedgeFundsRedemption.md file.
This is a fork from jckantor for Python dateutil rule sets for NYSE trading days and holiday observances. The original rules are valid for time from now on. Sometimes for backtesting or pattern recognition there is a need to have NYSE trading days for the past several years. The rules were modified to produce NYSE trading days and holiday observances from 1986 and on.

This website was created using Pelican, which is a Python tool.

Below are some of the Python libraries and tools that I use:

Pandas, NumPy, Beautiful Soup, Requests, Selenium, Cython, SciPy, NLTK, Matplotlib, pytest, unittest, Django, Flask, Redis, RQ, Pelican, reStructuredText, Sphinx.

View comments.

more ...

DB - SQL, NoSQL

I have used different databases - SQL(relational) and NoSQL(non-relational), depending on the project.

In Python often I would use Pandas to load data from db into a dataframe, process the data, and save the data.

In Java, I have used either API like JDBC or ORM like Hibernate.

In C++ i have used db library c-tree Plus, which allows you to design your own index structures and have a db of SQL or NoSQL flavor. Below is a description of how c-tree Plus was used in one of my projects.

All db related activities were my responsibilities, including data and code. We have used the c-tree Plus library as a db engine embedded in the application process address space. All db calls were encapsulated into the library used by all members of the development team. The library was, in essence, an application-specific Object-to-Datastore Mapping tool. Later SQL flavor was added to the library and all calls to c-tree Plus were converted to Sybase, and in this case the library behaved like an application-specific ORM(Object-relational mapping) tool. The transition of applications from c-tree to SYBASE was very smooth. And developers have a choice to use an embedded DB or a Sybase server.

SQL

Often the projects that I was on required different levels of SQL handling.

For one project i did a conversion from Oracle, PL/SQL to Python, Pandas, MySQL. Pandas was used, but a lot of work was done in SQL scripts.

For another project that was PostgreSQL based, I wrote a database schema conversion script and developed a synchronization module between an application cache and a database. Triggers were developed in PL/pgSQL, C, and Python. It's a very nice feature of PostgreSQL to have different languages that can be used for stored procedures; in my case, Python code that replaced the code originally written in PL/pgSQL was more simple.

NoSQL

My experience with NoSQL databases includes OO db, Graph db, and in-memory key–value database.

OO db was used for a Quartz based project.

Graph db. On one of the projects I was assigned a ticket to fix a time critical task. It turned out that the task was spending nearly an hour on one db query when other queries were in seconds, it was - Neo4j. A workaround around Neo4j issue was found and the time of the query was reduced from about an hour to several seconds, Neo4j has very good docs.

Below are some of the databases that I was using:

PostgreSQL, Oracle, MySQL, SQLite, Sandra, Neo4j, SYBASE, SQL Server, Redis

View comments.

more ...

Development Environment


Currently, most of my development is done on Linux or/and Windows.

Version control is Git. In the past I have used Mercurial, Subversion, CVS.

Most of my scripting, these days, is in Python and sometimes in Ant. And the main reason for this is that Python and Ant scripts can be executed on Linux and Windows. I have also used Perl and Groovy. I use the system shell when I have to, portability has its limitations and sometimes you just need to use the system shell. On several projects I was responsible for setting up and managing development environment.

Most of the time I use PyCharm, sometimes Eclipse, PyDev, Visual Studio Code, or any of available text editors, depending on the environment.


View comments.

more ...

Open Source

GitHub

First Text Analysis Python Project was my first unstructured text analysis project.

GitHub Gist


My open-source contributions were in different languages. Tools and languages that were used are listed at the end of each contribution description.


              Pandas and pandas-datareader
While using Pandas in a stock screening application, encountered problems that were present only for certain data values. The first solution was a workaround. It worked, but the solution was rather cumbersome and ugly. And what about existing code that might work with some data but not with other. Modifying Pandas code looked like a better solution. The changes actually affected two projects: Pandas and pandas-datareader, as pandas-datareader is now a stand alone project.
Below are the changes:
Made improvements to missing values handling, testing, documentation.
A signature-preserving decorator for Python 2.
Some changes were an API changes, which also included changes to what's new and documentation.
Provided a workaround for one of the issues as a temporary solution while the issue is being resolved, participated in discussions.
Python, Cython, pytest, Sphinx, reStructuredText.



    pandas-datareader pull requests:
https://github.com/pydata/pandas-datareader/pull/364 (merged from a command line and GitHub GUI incorrectly displays it as not merged but correctly shows it as merged in the source code https://github.com/pydata/pandas-datareader/commit/6cce5f18d52be802c7245c8a28d534236a9e2b24 )


              Arelle
    Arelle pull requests


              geWorkbench
At the time I was at The Center for Computational Biology and Bioinformatics(C2B2), Columbia University.
Most of my work was on geWorkbench, a Java-based open-source desktop application for integrated genomics. While on geWorkbench I was involved in new features, design, enhancements, and bug fixes.
There was some server side work too, mostly on a grid based application CaArray, which was an open-source web and programmatically accessible array data management system. Part of my CaArray responsibilities included installation and administration.
Some of the tasks required advanced calculus and mathematical modeling.
Java, C++/C, Swing, JBoss, Tomcat, Ant.


                GitHub Gist has Python and Java code.
The program reads data from an Excel file, which should have at least two worksheets: one for the funds liquidity terms and the other for the tranche investments. There are three scripts to produce reports and graphs. The focus of the program is on the most common hedge funds withdrawal restrictions. A more detailed description is in the HedgeFundsRedemption.md file.
This is a fork from jckantor for Python dateutil rule sets for NYSE trading days and holiday observances. The original rules are valid for time from now on. Sometimes for backtesting or pattern recognition there is a need to have NYSE trading days for the past several years. The rules were modified to produce NYSE trading days and holiday observances from 1986 and on.
The dataset is in a CSV (comma separated) file. The program reads the dataset into a dataframe and performs some operation on the dataframe. The program can be used as a library or from a command line. The user can test the code from a command line using simple language to define the operations. The purpose of this exercise is to demonstrate that in the absence of the library like Pandas in Java for any advanced data processing job you are better off using Python and Pandas, even if it involves learning a new language. However, for a Java programmer who doesn't know Python and needs relatively simple dataset column operations, it might make sense to use something like this. README file goes into more details about implementation and various limitations even for this tiny subset of Pandas functionality.
    Other contributions
Submitted a bug fix for integration with aspell, spellchecking C++ library. Several years ago Leo switched from aspell to PyEnchant.
Python, C++.

Submitted code patches to Apache Axis to improve AxisFault logging and provided a workaround.
Java.

Filed bug reports against Java Swing, JBoss, and several others Java tools and libraries.

View comments.

more ...

C++


Below is a short description of some of the C++ projects that I was responsible for. The projects related information presented below is already in the public domain.

  • a wrapper around a client infrastructure library:
The infrastructure was rather complicated, as it was a company-wide infrastructure, and included a persistence layer, a shared cache, and a messaging component. The goal of a client infrastructure library was to hide most of the complexity from the users. This design not only made the infrastructure easier to use but also enabled it to be used with third-party libraries and legacy code. For more information see Presentations.
  • database design and a wrapper around db library
All db related activities were my responsibilities, including data and code. We have used the c-tree Plus library as a db engine embedded in the application process address space. All db calls were encapsulated into the library used by all members of the development team. The library was, in essence, an application-specific Object-to-Datastore Mapping tool. Later SQL flavor was added to the library and all calls to c-tree Plus were converted to Sybase, and in this case the library behaved like an application-specific ORM(Object-relational mapping) tool. The transition of applications from c-tree to SYBASE was very smooth. And developers have a choice to use an embedded DB or a Sybase server.

Other C++ activities included: participation in various C++ discussion groups, C++ consultations, teaching C++. Also moderated nyc design patterns study group and posted summaries of the discussions; most of the code samples were in C++.

After reading an early edition of Scott Meyers' book Effective C++, I sent an email to the author with my comments. The email started an extensive email correspondence with Scott Meyers; the correspondence resumed after the publication of his next book More Effective C++. Scott Meyers cited me in the following editions of his books Effective C++ and More Effective C++.

Below are some of the C++ libraries and tools that I was using. For garbage collection often “smart pointers” were used.

STL, Rogue Wave libraries, Orbix, CORBA, TIBCO, c-tree Plus

View comments.

more ...

Java


My Java experience is evenly split between Core Java and J2EE.

In Core Java, I have used various Java packages and tools including collections, multithreading and concurrency ... A partial list of Java packages and libraries that I have used is at the end of this entry. Often a Java application that started as a single-threaded one will end up as a multi-threaded application; it helps that Java has very good concurrency support.

In J2EE, I have spent more time on the server-side; but I have also done a fair amount of the client-side development. On the client-side, I have worked on desktop-based clients and browser-based clients. For desktop development, I have used Swing and experimented with JavaFX(Swing and JavaFX are technically part of Core Java). Browser-based clients also included non-Java tools and languages like Javascript, JQuery, HTML, CSS. On the server-side, I have used several application servers and libraries, a partial list of which is at the end of this entry.

I have also done a considerable amount of XML processing, data feeds processing, web services creation and consumption, web scraping.

While working with Java, I was involved in new applications development as well as working with legacy systems. Probably a separate entry is needed for a more detailed description of the projects.

Some samples of Java code are on GitHub Gist

The dataset is in a CSV (comma separated) file. The program reads the dataset into a dataframe and performs some operation on the dataframe. The program can be used as a library or from a command line. The user can test the code from a command line using simple language to define the operations. The purpose of this exercise is to demonstrate that in the absence of the library like Pandas in Java for any advanced data processing job you are better off using Python and Pandas, even if it involves learning a new language. However, for a Java programmer who doesn't know Python and needs relatively simple dataset column operations, it might make sense to use something like this. README file goes into more details about implementation and various limitations even for this tiny subset of Pandas functionality.

Below are some of the libraries and tools that I was using:

J2SE/J2EE, JSP, Spring, Hibernate, EclipseLink, JPA, Ejb, JMS, MQ, TIBCO, Twitter4J, Jsoup, Tomcat, JBoss, Glassfish, Log4j, Guice, JUnit, Mockito, jQuery, Tomcat, JBoss, Apache Axis, JXPath, JAXB, Swing, Restlet, Ant, Maven.

View comments.

more ...

Presentations


These presentations describe a wrapper around a client C++ infrastructure library. The infrastructure was rather complicated, as it was a company-wide infrastructure that included a persistence layer, a shared cache, and a messaging component. The goal of a client infrastructure library was to hide most of the complexity from the users. This design not only made the infrastructure easier to use but also enabled it to be used with third-party libraries and legacy code.

This design was greatly influenced by the STL adapter classes, the reference classes from the IBM collection class library, and the Orbix TIE approach. In the External Polymorphism Pattern this design is cited as an independent discovery of a variant of the pattern.

These two presentations describe the same work but from different angles. The UP'97 (international conference on Using Patterns) presentation is more detail-oriented and with more code samples, while the COOTS'97 (Conference on Object-Oriented Technologies) presentation is more design oriented. The last sentence of the COOTS'97 presentation started heated debates, among participants in the debates were Bjarne Stroustrup, Scott Meyers, Doug Lea, Andrew Koenig.


COOTS'97   -     HTML     PDF

UP'97           -     HTML     PDF


View comments.

more ...

Miscellaneous Development


This page is a temporary place for some of the topics, in no particular order, that I was working on and that don't have their blog entry yet, but hopefully will find their home soon.


Multi-threaded and distributed programming, concurrency. Java vs Python - multithreading and multiprocessing, GIL.

Messaging, Distributed cache - Redis, RQ, Celery, RabbitMQ, JMS, Tibco, IBM MQ, Oracle Coherence. On one project we have used Qpid for testing and IBM MQ for production.

RSS and Twitter feeds processing. Twitter4J, JAXB.

Science and Math; data structures and algorithms - Pandas, NumPy, SciPy.

Decision support system for railroad network management. Mathematical model describing flow of railroad cars in railroad region.

Lexical analyzer and parser development.

sysadmin and dbadmin tasks. Internal Maven repository set up.

Troubleshooting and resolving production issues.

Working with legacy code, maintenance, enhancement, refactoring.

Other languages - Fortran, Assembler, Pascal.

Design, Design Patterns.

Publications - decision support systems, operating systems, numerical solutions of partial differential equations.

Taught programming, math as a part-time job.

Participated in different Python, Java, and C++ forums and mailing lists. Moderated The NYC Design Patterns Study Group and posted summaries of the discussions.

View comments.

more ...

Technical Writing


Technical writing is sometimes defined as simplifying the complex(techwhirl.com), the same goal that I am trying to achieve in my code. There is a saying that explaining helps you better understand what you are doing, and in this sense writing documentation often helps me better understand the code.

If the code has any value people will read and modify it. Any documentation that clarify the code will be helpful, even when reading your own code many years later.

And the same could be said about user guides, which helps the users of a tool or a library and don't force them to read the code in order to use it. One of the examples would be a J2EE project where I was responsible for the project migration to Maven. Part of my job was setting up a project structure, including the project build and deployment; and also setting up an internal Maven repository. Writing the user guides had an extra benefit: members of the team were able to use Maven without asking me a lot of Maven related questions.

Most of my technical writing was part of the internal documentation and isn't in the public domain. What is on this site is in the public domain, as well as my open source contributions that include documentation.

Even as technical writing was a relatively small part of my job and not a full-time occupation, I took several academic writing courses at Columbia University, and passed level 10 exam; my teachers liked my writing and encouraged me. I really enjoy the creative process of writing.

Below are some of the tools that I am using:

reStructuredText, Markdown, Sphinx, wiki.

View comments.

more ...

Testing


I have done functional and unit testing. There was even a period in my career when testing was my main responsibility, but this was some time ago.

Ideally, you want to test everything. But there is a lot of prototyping and experiments, especially in Python, and writing tests for throwaway code looks like a waste of time. And there are deadlines, and testing often is the first casualty of deadlines. One place where testing is rigorously enforced is Open Source development, and it is definitely the case with Pandas. Any Pandas contributions should have a corresponding test case, and I learned a lot about Python testing as a result of my Pandas contributions .

In C++ I wrote a client/server Test Bed for System Testing of infrastructure. It was the time when testing was my main responsibility.

In Java and Python, I wrote a fair amount of unit tests, but they were mostly run of the mill unit tests, similar to what you can find online.

Below are some of the libraries and tools that I was using:

pytest, unittest, mocking, Selenium. JUnit, Mockito, PowerMock

View comments.

more ...

Ant Script to Update/Install Eclipse


The script, which has been written several years ago and hosted on this site, is now on GitHub.

At the time of this writing Eclipse was relatively new and as a new platform was lacking some important features that were left to plugin writers. Many Eclipse users were waiting anxiously for a new Eclipse build or a new version of a popular Eclipse plugin.

However, after a new version arrived, update was not very straightforward, as there were too many rules to follow and Eclipse documentation was not as good as it is now. Plugins usually were downloaded and installed manually, and plugin writers were using different conventions to package their plugins. Occasionally a new version was incompatible with the old one in such a way that update could make your workspace unusable. The script addressed all these issues and was easily customizable; sometimes you just need a command line tool.

It is difficult to say for sure how widely this script has been used. At the time Apache Ant was relatively new and used as a build tool, a replacement for the UNIX make utility. Some indications of the script usage could be emails from the users with questions, feature requests, suggestions, and code contributions. The script was reviewed in several blogs and newsgroups, and googling at the time for "Ant script" produced results with this script having the highest rank.

The following text has been written several years ago, at the same time as the script; and even if some statements might look slightly dated, I have decided to leave the original text intact to preserve the spirit of the time:

Eclipse is a great IDE. It is relatively new and the speed with which Eclipse team introduces new features is amazing. However, my first update to a new Eclipse build was rather time consuming; and from reading news I realized that other developers were struggling with updates too. The script simplified updates and, over time, evolved to incorporate some other related activities. All the documentation is in the Readme file. One chapter in the Readme file, named What is the Right Way to Update?, is different from the other chapters as this chapter is not about how to use the script but rather why and when to use it.

You can download a zip file from this site or just browse Readme file online. Latest changes are in Release Notes. It is worth noting that Eclipse comes with the Update Manager but it seems to be for major releases and not for builds. The primary audience for this script is developers who update Eclipse quite often.

Read what others think:

http://youarenumber6.blogspot.com/2004/08/departmental-eclipse-with-ant.html

http://www.jroller.com/page/dorodok/20030106


View comments.

more ...