Text Deduplication in SQL

Data deduplication is essential when importing similar data from different sources. Different providers store data differently, and several variations (both correct and incorrect) exist in the English language for names of people, companies, and entities in general. Deduplication is often made easier if there is a lot of other information associated with the data because it gives you several things to compare to identify a dupe (such as birthday for people, location for company, etc. Read On →

Mapping the Snapchat Data Leak

I’ve been following the Snapchat data leak pretty closely the past few weeks, from the announced weakness to the actual leak of the phone numbers. What I found most interesting about this in particular was that instead of email addresses, password hashes, or credit cards, the leaked data was geographical, mappable data. Digging into what actually leaked, I saw some analysis that suggests that only a fraction of area codes were represented in the leak. Read On →

Introducing Tweet Zone

About a month ago a Washington Capitals blog called JapersRink tweeted this out: How is it that we're in November 2013 and there isn't a Twitter account that just tweets out every single NHL goal when it's scored? — JapersRink (@JapersRink) November 16, 2013 Good question. Having worked with Twitter’s API, I knew that this would be fairly easy to accomplish if only I had a source for the score information. Read On →

Parsing JSON Array With Missing Elements

A project I’m working on requires me to parse JSON arrays from strings that I do not have control over. One of arrays I see frequently has empty elements at certain positions in the array. This normally isn’t a problem if done correctly, but in this case it is not done correctly. ["Thu","8:25","Info",,"Entity 1","17","Entity 2","13",,,"55999",,"November","2013"] If you try to interpret this in Javascript, you’ll get a syntax error at the fourth element when it encounters the second comma. Read On →

Hashing With SQL Server CLR

I have been looking at using hashes in a computed column to determine equality among rows, rather than compare each column. While running some tests, I encountered a limitation with SQL Server’s HASHBYTES function: the input can only be 8000 bytes or smaller. This won’t work for our purposes, as some of our tables have NVARCHAR(MAX) columns whose maximum length exceeds 8000 bytes. One solution I’m looking into is using a CLR. Read On →

Moving Files in Android

Android users have the luxury of being able to access the storage on their devices. Because of this, many prefer to maintain their own directory structure and organization pattern for their files. A frequently requested feature for Vibe Vault is the ability to change the directory to which we download files. Many apps unfortunately don’t allow users to change where files are stored, and I suspect it’s either because developers hard-code path variables, or they don’t want to risk moving around existing files and directories. Read On →

Moving a File In The Android MediaStore

When you move or copy a media file in Android, any app that access it via the MediaStore will not automatically be updated. In one of my previous articles I talked about querying and manipulating Android’s MediaStore, and the cost of doing a full scan of the file system to check for changes. Luckily, there’s an easier way to update the file’s location: use ContentResolver’s update() method. ContentValues values = new ContentValues(); values. Read On →

Pagination Syntax in SQL Server 2012

When we started upgrading our SQL Server 2008 instances to 2012, I went back and reviewed the new development features that were added. There were many that I was excited to try out and find what performance and readability improvements I would experience. Among those was the pagination enhancements to the ORDER BY clause which allow you to specify an offset and number of rows you want. MySQL developers have long used the convenient LIMIT statement, but SQL Server developers have had to use subqueries or CTEs with ranking functions to achieve the same effect. Read On →

The Case Against 'OR' in MS SQL Server

When writing SQL queries, I try to use logic that is intuitive to myself and to someone else who may read my code in the future. While this is good practice, it is equally important to consider the performance of what you write. I often see queries where the author crams as much logic into a single query as possible, presuming a smaller and less complicated query will make for a more efficient query. Read On →

Queries in SELECT statements vs APPLY in MS SQL Server

There are many times working with SQL when you want to select just one particular row from a table which has many that satisfy a join condition. This could be when you want the latest record, or just one name from a large set of valid names. I’ve personally encountered this when writing BI reports. For this example, I’m using a table called ‘Entities’ and a table called ‘Names’, which is referenced by the column Entities. Read On →