When dealing with geospatial data it is sometimes useful to have a grid at hand that represents the given data. One way to create a grid like this is to use GeoHashes. GeoHashes are a hierarchical spatial data structure which subdivides space into buckets of grid shape, which is one of the many applications of what is known as a Z-order curve, and generally space-filling curves. A geohash is an encoded character string that is computed from geographic coordinates.

Continue reading

On October 5th the PostgreSQL Global Development Group announced the release of PostgreSQL 10. It comes with tremendous amount of new features like Table partitioning Logical replication Improved parallel queries Stronger password hashing Durable Hash Indexes and more. A nice list, including explanations can be found on Robert Haas’ blog. This post explains how to upgrade to the latest version of PostgreSQL on macOS using Homebrew. At the time of this writing I was using macOS 10.

Continue reading

After three great days at the PyCon US 2017 in Portland, OR Hendrik and I decided to participate in the development sprints succeeding the conferece. The code sprints are an essential part of PyCon, and a chance to meet some of the maintainers and contributors of various open source projects. For us it was the first time attending a code sprint. The day before the sprint there was a session helping people to set up Git, Python (including virtual environments) and getting familiar with version control.

Continue reading

Recently, I set up Jupyter Notebooks on a server at work. The idea was to create an enviroment where every team member could run analyses using Python and share the results with the rest. After reading the documentation, I found out that the Jupyter Notebook web application comes with a Contents API I quickly put together a little Munin script that collects some statistics about the current notebooks. The graph shows the total number of notebooks on the server as well as the currently open notebooks:

Continue reading

There are a lot of cases when we want to track time when an entity was created or updated. Here is a simple recipe to make some or all of your SQLAlchemy entities auto-timestamping. To achieve this, we will provide a mixin class. from datetime import datetime from sqlalchemy import Column, DateTime, event class TimeStampMixin(object): """ Timestamping mixin """ created_at = Column(DateTime, default=datetime.utcnow) created_at._creation_order = 9998 updated_at = Column(DateTime, default=datetime.utcnow) updated_at.

Continue reading

Everyday at work around noon the question of where to get lunch comes up. Normally, we choose between different restataurants in the vicinity of the office. One exception is the HU Mensa (university cafeteria). Despite being really cheap the food quality there varies a lot and it really depends on the daily menu whether a visit is worthwhile. To tackle this issue I decided to spent another IT Open Space putting together a little script that will help us in the future.

Continue reading

At Project-A we are using Codebase as a project management tool together with its version control. Just as with any other tools you can create tickets and organize them in sprints. Our usual (very simplified) workflow includes: Sprint planning for tickets Priotizing tickets Developer working on tickets Product managers verifying if the tickets were implemented as intended Unfortunately, sometimes your backlog keeps growing and tickets are no longer valid, outdated or, in the worst case, just forgotten.

Continue reading

An ETL import graph is build on logical dependencies of the jobs to each other. So typically a SQL transformation job depends on all the previous jobs that create the tables used in the query. But once there are a certain number of jobs, dependencies often get a bit more complicated and some of them become redundant in the process. A simple example can be seen in the dependency graph from figure, where the three red edges are redundant.

Continue reading

To speed up the ETL data pipeline, you should try to run jobs in parallel. Obviously, not all jobs can run at the same time in most cases, since there are dependency constraints between the jobs and limits of the servers capacity (number of processors and/or IO bandwidth). So assuming the server allows you to run n jobs in parallel, often there is the situation that the dependencies give you the option to run any of a set of m different jobs with m > n.

Continue reading

Once you have set-up a web server like Apache or nginx running on the Raspberry Pi it is time to create a website. From here there a several options: A CMS that relies on a database, some purely manual crafted pages or a static pages generated by a script. I chose the latter for some reasons. Static sites have a lot of advantages: no database to slow requests down offer greater security, as they do not contain dynamic content, so are immune to the most common attacks flat, text files, makes them ideal to be used with version control systems, such as Git low footprint on the server as serving raw html files But there also some limitations:

Continue reading

Author's picture

Christian Stade-Schuldt

Data Engineer @ HERE IoT innovation lab| Full-time geek | Cyclist | Learning from data

Data Engineer @ HERE

Berlin, Germany