Phelio Gnomi

My words from my mind

From Unicorn to Unicode

What is worse than knowing that Unicorn exists in some other dimensions but you will never be able to see it?

My answer will be the xA0 character from some encoding world that I don’t even know to exist. Being an Earthling, the only encoding world I’ve been and known is the Unicode. More specifically the UTF-8 realm.

Interestingly, many Unicode based systems reject the xA0 (or any nonconvertible characters) and totally crashes the system. Take Python for example, and also PostgreSQL later on.

Python

In Python, there is a function call unicode() that convert a string from other encoding to Unicode.

unicode(object[encoding[errors]])

However, the “errors” handling is defaulted to “strict”. It means that it will complain that “Something is wrong” whenever there is an error. Basically it means that it will break the system when there is an untranslatable character in the object that you are trying to convert.

There are two other options in handling conversion errors.

  • “replace” to replace the untranslatable character to the official Unicode replacement character
  • “ignore” basically replace the untranslatable character with an empty string.

PostgreSQL

When inserting non Unicode strings into an UTF-8 (Unicode based) databases, PostgreSQL will try to translate them first. Same thing will happen if the said string contain an untranslatable character, it will throw you an error.

This can be a hell of a problem because it technically break your system if your system is a one of those systems that process input and save them into a database.

So the solution is usually to try to catch these unicorns before they escaped into the database.

The adventure of the Old Mac line breaker in the Python world

There are many representations of a new line, End Of Line indicator, or a line breaker. You probably heard of the terms Line Feed (LF) and Carriage Return (CR). They are technically characters like the letter “A” and small letter “a”. But instead of printing the letter, they tell the system that it’s the end of a line. However, different computer system uses these 2 common characters in different ways but let’s narrow it down into the 2 most common ones, namely the Unix version “LF” and the Windows version “CR+LF”. But wait a minute, there is this Old Mac version as well that uses only CR character to represent the end of line.

Interestingly in the Python’s universe (and probably some other even more racist universes), the Old Mac convention is by default not a line breaker. If you read a file full of lines that only ends with “CR” using the standard file open() function in Python, they will come out as a single line text.

As a slightly less racist developer, we need to build applications that can support as many types of stuff as possible. Here are 2 tricks to help you ensure the file you are reading is read properly the next time you use it.

When reading a file

# Use the 'rU' mode so it understand the Old Mac properly
file = open('filename', 'rU')

If you happen to be working with File upload in Django, this might be useful

# http://stackoverflow.com/questions/1875956/how-can-i-access-an-uploaded-file-in-universal-newline-mode

# First, read the uploaded file and convert it to unicode using unicode() function
# Second, stream the file using io.StringIO function with the Universal-newline mode turn on by setting newline=None
import io
stream = io.StringIO(unicode(request.FILES['foo'].read()), newline=None)

Slicker way to export crosstab in Tableau

Tableau has an awesome feature to allow users to export the processed data into crosstab, or I would say Comma Separated Value format.

We use Tableau to help us visualize data into graphs and charts. And our raw data is often messy and huge and doesn’t really make sense in that granular state. So the charts and graphs are really much better to look at. And even though some users are happy enough by looking at charts, some others find it important that we can get the post-calculated raw date, i.e. the numeric equivalent of the given chart.

Image

But when it comes to a grander scale of a few thousands rows of records, The Export Crosstab to Excel just doesn’t work well. It took a very long time to work, and that if it works at all.

So, instead of exporting it to Excel, use Copy -> Crosstab function. This function is surprisingly much faster than exporting to Excel.

Image

As a comparison, exporting 50,000 rows to Excel will take forever, and never but copying the same 50,000 rows to clipboard only takes 5 seconds.

The Magical Tableau Filtering and how to get the Tableau result limiting right

NOTE: this tutorial is based on the coffee shop sample data came with tableau desktop.

The other days long time ago, we all thought that there is a simple way to limit the results of a tableau report. Even though it’s a bit of an unwritten magic, but the legend stays true that it’s quite simple indeed.

Goal

Say that you have a coffee shop chain that is doing so well it has branches in all 4 corners of the imaginary planet of Zox. Now you have no other thing to do, you decided to look at the ups and downs of your profit throughout the year. So you decided to make a chart that looks something like this

Image

 

Now you can see how much sales your coffee shop is making along the year in a very nice chart. But it’s too much information on the chart, so you decided to just show the top 3 products based on its sales.

To do that, you pulled the product field from the Dimensions field box into the Filters box on the top left corner. A dialog window will pop up and navigate to the “Top” tab. You set it to only show the top 3 products based on the total sales of that product.

Image

Image

Problem

All looks good so far. After looking at the chart for like 10 seconds, you got bored again. So you decided to add more fun to the chart. “How about getting the top 3 products of each product type you have, Coffee, Espresso, Herbal Tea and Tea” you thought, then you proceed to pull the Product Type field into the Filters box and make the selection box appear on the right side of the chart.

Image

(Right click on the Product Type filter and click “Show Quick Filter”)

and tada… everything looks wrong now. It only shows 1 product, instead of Top 3 products as you’ve commanded it to show.

Image

 

Solution

So you play around with the filter on the right and figured that it might be the culprit. You looked back at the pictures that you’ve screenshotted and noticed something that might be useful, the “Add to Context” option on the right click menu.

Image

You do that to the Product Type filter and voila!! it works!!

Image

Conclusion

Tableau filter by default doesn’t apply to the “Top x” filter. By adding a filter to context, it’s specifically telling tableau that the filter is to be taken contextually. Without the “add to context” option, the “Top x” will get the top x across the product type, as if the filter is not there at all.

This problem had costed me a million seconds to fix, but I hope that this short article will save some of your time.

Alteryx encoding

Have you ever get so fed up because there is no such thing as “Encoding” tool in the vast collection of Alteryx tools and you are getting a non “CSV” export file filled with unreadable characters in it? Well, You’ve came to the right place.

Alteryx uses the term “code pages” to describe encoding, or UTF-8 or Character Sets setting. What the hack is code page anyway? can’t you just say “Encoding” or “Character Set” so it’s more commonly used nowadays?

Anyway, here is the solution. First, search the Alteryx help index for “code page” and it will list you a list of encoding standards with its respective code.

Image

Now, comes the biggest spoiler, the encoding tool is hidden inside the Formula tool as a “function” and named ConvertToCodePage and ConvertFromCodePage.

Image

 

 

 

How I forced myself to spend time on a side project

I found it so much easier for me to be productive when I’m at office, forced to work from 9-5 every Monday to Friday than to working at home on a side project where I don’t really have a fixed working hour.

Why is it so hard?

Few differences that I observe to be different between office and home and how I fix them:

You are mostly working alone at home.

Most of the side projects I’ve done is a lone project. Somebody (or me) started it and a developer is needed and I volunteer and boom, you are all alone. It could be a good thing for some people, but I find it very demotivating.

Well, most projects failed badly and I blame the loneliness that I felt during the project. I felt that I needed approval from another peer. I needed somebody else to say that, “Hey, it’s awesome”. Or, “hey, you can do this better by doing this”. You have peers probably scrutinizing your work all the time at office. But when you don’t get the scrutiny, you don’t feel confident.

But hey, again, you are the master of everything now. It’s your own project and you know best about everything. I need to constantly remind myself that “I’m a good programmer” and “it’s OK to make mistakes”. It’s easier to say than done, of course. But, it’s really normal to fail. Failure is how we learn to be better.

Another thing that might help is the online community. Go to forums, stackoverflow, or github. Share your works, and get feedback or get some peers to help you do the things that you are working on. It’s kind of popular right now, isn’t it?

You don’t have a fixed time at home.

When you are at home, you start working when you feel like working. But there is always a problem to start working. I found that the problem with me is that I am lazy. And the reason of why I’m lazy is because the amount of time that I need to spend working is infinity. Unlike at office, you have a finite working hour everyday. You know that you just need to work until 5 o’clock and after that you are free. And that kind of make you happy a little bit.

Do the same thing at home. Set the amount of time you are willing to spend on your side project. For me, the easiest way to do so is by setting a timer. In fact, I have a timer running right now as I’m writing this. Set it to one hour each, don’t make it too long. Just enough for you to finish a small task.

So many distractions at home.

You got so much freedom at home that you probably won’t start doing anything. That’s certainly not true. I’m supposed to be working on my side project right now, but instead, I’m writing this article. Let it be. It’s something productive anyway. Don’t feel bad about yourself. Do things that you enjoy doing. If the project never got finished, at least you’ve got some fun time doing something else. Don’t to be too harsh on your life. There is time for playing, and there is time for working. When the time for working comes, you will find yourself working.

You don’t directly get paid when working on a side project.

Well, this is probably the hardest part. Money, either directly or indirectly, is probably the greatest motivation of all. It’s really hard to do something unless you are certain that it will give you something back in return. And that something is normally money. But, for personal project, you are normally getting the money at the very later stage of the project. Probably after you’ve spend thousands of hours and bloods (and money).

That’s why you need to swift your motivation a little bit. Start by thinking that money is not what you are aiming for. Let the project be a “just for fun” project without shutting its prospect to shine and become a gold digger. In other words, only do project that you enjoy doing unless you get paid upfront.

The art of writing codes

I’m supposed to be fixing a bug in my Expressionengine module right now. But, instead of that, I feel like procrastinating a little bit and write about fixing bugs. Not exactly just about fixing bugs, but a lot of other things about writing codes. But well, I think I didn’t mean to write about bugs, but since it’s been written this way, I will just leave it there. I will write about fixing bugs next time.

This time, I would like to write about writing. Writing is a very old technology. Writing is said to be the requirement of a civilization to success. And now, we have come to a point of time where writing is very important for every single person on Earth (I’m not really sure about other planets, though).

Holy crap, this is going to be a very boring article.

There are probably tons of articles on how to write English well. My question is, can we apply these human language principles in programming languages? why not?

Writing codes for computers has evolved from writing symbols that only computers can understand (or probably binary? I don’t know) to the variety of so many high level programming languages where it’s more for human to read rather than the machines. But the problem is, there is still a very big gap between what a computer can understand and what a human can read. And I have to say that it’s easier to write for the machine than for human to read.

My job for the past few weeks are basically reading codes. I would like to call my self a software archaeologist.  Wait a moment, some other dude already used that term before? fine…

Reading people’s code is time consuming, stressful, difficult, and a very complicated job. Programmers love to complain about having to debug some buggy codes or  being in the maintenance mode without being able to write new codes from scratch. But, let’s read it like a book. There are stories to unfold, mystery to solve. There are characters like functions, objects, variables, classes. the more we read, the more we understand them. The emotional connection between the readers and the fictional characters grown.

What if we write codes like we write books. What I mean here is that we take human as our main audience and focus on delivering information that are intended for human to read. That’s probably what we have been doing all this time by writing comments in our codes. But, writing comments are just part of the task here. I believe we can do much more than writing comments. We can even write to a point where we don’t really need to write any comment anymore.

Proper naming of your characters

A good writer doesn’t roll a thousand faced dice filled with human names and set their characters based on the names that show up. Most writer pick a name because it has something to do with the story. Darth Vader actually means Dark Father where he turn out to be (Spoiler alert!!) Luke’s fathers.

We name our everything in our codes. We are the god of that universe. But that universe is not yours alone. So, name them for the purpose of clarity. We want the readers to understand what a function does without having to rip open that function. In other word, generate_random_name() should generate a random name and kill_your_self() will actually kill your self.

I know it’s difficult sometimes to pick a proper name. But, at least think about the future human readers when you are naming them.

Human readable algorithm

Once you’ve got the naming right, it’s easier to write human readable algorithms.

if andy.isStupid:
  andy.go_to_school()

But, there are also some other things to consider. There are always more than one way to achieve a goal in coding. The kind of codes I would like to read are the ones that are easiest to read without compromising performance. At the end, speed is still the most important thing for your boss.

Spaces

Put a lot of them. Put a line break on every groups of codes. Make it clear that this chunk of codes are family while its neighbor down there is something else. Remember, you are writing for human to read, not the machine.

Consistency

Last, and the most important one IMHO, is make it consistent. Make every style in your codes mean something. If you like camelcase, use camelcase. I sometimes use camelsase and under_score for a different purpose. camelCase for variables and under_score for functions.

It’s never about right or wrong. It’s about trust. If camelCase means functions, and suddenly I see camelCase which is not a function, I lose that trust on you. I can’t trust you anymore and it makes the whole program harder to read.

From rwx to 777

Have you ever come to a situation where you are so new in town and you don’t know how to blend in? You kind of see everybody as a codes that’s made up of rwx rwx rwx but you are speaking in the language of numbers like 667 or 557.

file premission

Now that you need to be like everyone else, I’ll teach you how to extract the numbers from those “rwx”es. First, you need to catch one of them, bring them into a dark room called stat. Once he’s inside the stat room, he will speak out many other languages, including the number language that you need to know.

stat something.a

code

 

Now you know their number. You know what to do from there.

If you still don’t know, here is the magic words.

chmod <the magic number> <yourname>

For example:

chmod 0776 you

Speeding up your query with Temp Tables

If you have a query that:

  1. have a lot of subqueries
  2. join many different tables or subqueries
  3. going to return a huge amount of data
  4. take forever to run (more than 1 minutes)
  5. you have a busy database server

You can run your query in a hundred times faster speed by storing some groups of subqueries results into a temp table first before running the final joins using these temp tables.

This way, you will have:

  1. a smaller sets of queries to run
  2. which will reduce the memory usage
  3. and reduce the chance of blocking other running queries
  4. and it runs a hell lot faster.

Let’s make a super simple example.

--Old and Slow
--let's assume that huge_table_a and huge_table_b both contain 1 billion records each
SELECT tbl_a.item,
       tbl_a.profit_share,
       tbl_b.profit_share,
FROM   (SELECT item,
               Sum(profit) / Max(profit) AS profit_share
        FROM   (SELECT item,
                       profit,
                       UNNEST(regions) AS region
                FROM   huge_table_a
                WHERE  region = "south")
        GROUP  BY item) tbl_a
       JOIN (SELECT item,
                    Sum(profit) / Max(profit) AS profit_share
             FROM   (SELECT item,
                            profit,
                            UNNEST(regions) AS region
                     FROM   huge_table_a
                     WHERE  region = "north")
             GROUP  BY item) tbl_b
         ON tbl_a.item = tbl_b.item
         ORDER BY tbl_a.profit_share DESC;

Using Temp Tables

-- Will run a lot faster especially in a busy DB
CREATE TEMP TABLE tmp_table1 AS (
       SELECT item,
               Sum(profit) / Max(profit) AS profit_share
        FROM   (SELECT item,
                       profit,
                       UNNEST(regions) AS region
                FROM   huge_table_a
                WHERE  region = "south")
        GROUP  BY item
);

CREATE TEMP TABLE tmp_table2 AS (
       SELECT item,
               Sum(profit) / Max(profit) AS profit_share
        FROM   (SELECT item,
                       profit,
                       UNNEST(regions) AS region
                FROM   huge_table_a
                WHERE  region = "south")
        GROUP  BY item
);

SELECT tbl_a.item,
       tbl_a.profit_share,
       tbl_b.profit_share,
FROM   tmp_table1 tbl_a
       JOIN tmp_table2 tbl_b
         ON tbl_a.item = tbl_b.item
         ORDER BY tbl_a.profit_share DESC;

 

Better way to execute sql files in PostgreSQL

My previous post  is kind of dumb. It’s simple, but dumb. First of all, the /o command will only write the output of a particular query into file. It won’t write any error or any other information that pop out during the execution. The last time I ran that, it failed and I don’t know what happened or where it stop.

So, a better way to do it is by running psql with -f option.

psql -h HOSTNAME -U USERNAME -d DBNAME -a -f FILENAME > LOGFILENAME 2>&1

A little explanation here:

  • -a tells the psql to print out all the input commands to Standard Output. It basically print out all the queries inside your SQL file. This is handy so that you know what went wrong and where did it stop.
  • -f FILENAME tells the psql to run the given file
  • > LOGFILENAM is the basic UNIX command to redirect the standard output to a file. In other words, it says that whatever output that you are supposed to see in psql will be written into a file instead.
  • 2>&1 is very important because it asks the system to also print any errors into the standard output. Since we’ve redirected that standard output into a file, it means that any error will be logged into that file also.

For more information, checkout the psql manual and also the Bash redirection.

Follow

Get every new post delivered to your Inbox.