I’m trying to solve one of recent Kaggle competitions: “ICDM 2015: Drawbridge Cross-Device Connections“. That competition provides data on device/browser usage and asks you to determine which cookies belong to an individual using a device.
The data for this competition is available in two formats: CSV files and SQLite database. A relational database looks more suitable for ad-hoc queries because SQL is a quite powerful tool: you can easily join tables, filter and group data. Though it lacks some of statistical analysis capabilities which you can get in R or other tools specialized in statistics.
SQLite is an awesome technology for small embedded databases, but there are certainly no good GUI applications for querying SQLite databases. I also was afraid that SQLite query execution engine is not very smart, and SQLite dialect of SQL is not rich enough, in comparison to SQL Server or Oracle, so I decided to import SQLite database into Microsoft SQL Server 2012.
The first thing to access SQLite database on Windows is to install ODBC driver. I installed this one: http://www.ch-werner.de/sqliteodbc/
An obvious way to import data into SQL Server is to use SQL Server Import and Expoert Data wizard which uses SQL Server Integration Services internally. Unfortunately, when I tried to use it to import 15 gigabytes of data, it consumed all available RAM and froze the machine. I don’t know if I did anything wrong. There could be settings to make it consume less RAM. Or I could try another ODBC driver.
Another option I tried was to use OPENROWSET T-SQL function to access external data source from SQL Server query. The idea is to run an INSERT INTO statement which selects rows directly from SQLite database.
Prior to transferring the data, I created tables with the same structure as in SQLIte database. SQLite doesn’t have limits on string field length, so I created tables using VARCHAR(MAX) type for all columns. It was a mistake: SQL Server doesn’t support indexes for columns of this type. It is better to figure out max length and create fields of type VARCHAR(length) .
By default SQL Server 2012 doesn’t allow ad-hoc queries with OPENROWSET call. To enable it I executed these commands (see https://msdn.microsoft.com/en-us/library/ms187569.aspx for details):
EXEC sp_configure 'show advanced options', 1; RECONFIGURE; EXEC sp_configure 'Ad Hoc Distributed Queries', 1; RECONFIGURE;
The idea is to run an INSERT INTO statement which selects rows directly from SQLite database:
INSERT INTO cookies (drawbridge_handle, cookie_id, ...) SELECT drawbridge_handle, cookie_id, ... FROM OPENROWSET('MSDASQL', 'DSN=ICDM2015', 'SELECT drawbridge_handle, cookie_id, ... FROM cookies')
As copying millions of rows in one statement can be hard, I used small batches: import first 100,000 rows, then next 100,000 rows, until everything is copied. OPENROWSET requires to pass a constant value as a SQL expression to execute, therefore to pass @StartRowNumber and @BatchSize we have to execute dynamic SQL. Final SQL looks like this:
DECLARE @BatchSize INT = 100000 DECLARE @StartRowNumber INT = 0 DECLARE @RowCount INT = 1 WHILE(@RowCount > 0) BEGIN DECLARE @SQL NVARCHAR(MAX) = 'INSERT INTO cookies (drawbridge_handle, cookie_id, computer_os_type, browser_version, country, anonymous_c0, anonymous_c1, anonymous_c2, anonymous_5, anonymous_6, anonymous_7) SELECT drawbridge_handle, cookie_id, computer_os_type, browser_version, country, anonymous_c0, anonymous_c1, anonymous_c2, anonymous_5, anonymous_6, anonymous_7 FROM OPENROWSET(''MSDASQL'', ''DSN=ICDM2015'', ''SELECT * FROM cookies ORDER BY cookie_id LIMIT ' + CONVERT(NVARCHAR(20), @StartRowNumber) + ',' + CONVERT(NVARCHAR(20), @BatchSize) + ''') as F' EXEC sp_executesql @SQL SET @RowCount = @@ROWCOUNT SET @StartRowNumber = @StartRowNumber + @BatchSize END
It works fine, though not fast. Took a few hours to copy all 15 gigabytes of data. Then I created a few indexes to support execution of my ad-hoc queries.