Tuesday, December 15, 2015

Migrating AppEngine Datastore Entities to a SQL Database: Jenkins Automation

In two previous posts I laid out the process for extracting the entities from the Google AppEngine datastore and then loading them into Azure SQL.  While this process was successful, it was also a bit labor intensive.  I felt pretty certain that the process could be automated, so I set out to script it out and run it from Jenkins.

My initial goal was to complete the entire process, from initiating the datastore backup to pushing to Azure, without modifying the existing app in any way.  So, while the most straight forward way to backup the enties periodically is with a (shocker) cron job, I wanted to try to trigger the backup from outside the app.


Triggering the Backup


I knew I was going to have to obtain an oauth2 access token somehow, in a way that was reproducible in a script.  I found this article on Martin Fowlers site outlining how to get the necessary tokens via the command line.  I followed his process, using curl from the command line and simplifying a few of his processes, and was able to get refresh and access tokens from the command line.  Sweet! Now I just plug in the url for triggering the backup... and... no dice.  

No matter what I did, I couldn't get it to trigger the backup.  Kept getting 403 errors.  Best I can figure, I would have to create an access token using the parent project's service account.  This is possible via HTTP, but it's a huge, error prone hassle (you have to create a JSON Web Token... no thanks!).  I probably could have written a Java program to create an access token from a .p12 file (did this with my paperless timesheet application), but the benefit of avoiding a cron job was diminishing rapidly.

I resigned myself to adding a cron job and moved on.


Script: Importing Entities into BigQuery


The Google Cloud SDK includes command line tools for BigQuery (bq) and Cloud Storage (gsutil). The command to load data from a datastore backup into a BigQuery table looks like this:

bq {flags} load {options} source destination

For flags I need to specify the project id (I don't want to rely on an environment variable.  For options, I need both "replace" and "source_format". The "replace" option makes housekeeping easier, since it will automatically overwrite the table if it is there, otherwise I would have to explicitly delete the existing table.  The "source_format" option will specify that we are using a datastore backup.  The source is the .backup_info file for the entity kind we are loading into the table.  Expanded, it looks a bit like this:

bq --project_id={project id} load --replace --source_format=DATASTORE_BACKUP
      gs://{bucket}.{filename}.KIND.backup_info {table name}

One of the first hurdles I had to overcome was the matter of reliably getting the current backup_info file.  While gsutil has an ls -l command that lists the details of the files in the bucket, there isn't an easy way to directly sort the results by date.  After a bit of fiddling, I was able to trim the extraneous data from the listing and reverse sort it on the date information.  

I used temporary files to store the data from one step of the process to the next, but it proably could have also been done with variables or nested loops too. Here is the complete script:

@ECHO OFF
%=----------------------------------=%
%=  LOAD THE DATA INTO BIGQUERY     =%
%=----------------------------------=%
for %%z in (Ticket HtsTicketAssignment UserProfile OrgGroup) do (
    %=  run gsutil once to clear update notifications, then run again to get file list =%
    call gsutil > NUL
    call gsutil ls -l gs://wyoorgdev_backup_automation/*.%%z* > List.txt

    %=  Strip "TOTAL" line from input =%
    setlocal EnableDelayedExpansion
    for /f "tokens=1-3delims= " %%a in (List.txt) do  ( 
        if %%a NEQ TOTAL: (
            echo %%b %%c >> Trimmed.txt
        ) 
    )
    sort -r Trimmed.txt -o Sorted.txt
    for /f "tokens=2delims= " %%i in ('head -n 1 Sorted.txt') do (
        call bq --project_id=wyoorgdev load --replace ^
                --source_format=DATASTORE_BACKUP automation.%%z %%i
    ) 
    rm Trimmed.txt
    rm Sorted.txt
    rm List.txt
)

I'm running gsutil twice, the first time with no other options and redirecting to NUL.  The reason is that quite often when I was running it (and other Cloud SDK utilities), it would nag about updating to the newest version.  Since I'm saving this output to a file, I didn't want all the extra text muddying up the parsing later, so I run it once to clear the nag (it only does it the first time), then do what I need to do.  There is probably a setting somewhere to turn that off, but this works for now.

The first for loops throught the kinds I'm interested in exporting.  This value determines which .backup_info file to use and later is used to define the BigQuery table to create to hold the exported values.  The results of gsutil ls -l are saved in List.txt.  This includes the TOTAL line as well as the individual file sizes:


The next for loop tokenizes each line, and echos the second and third tokens for each line not beginning with "TOTAL" into a second text file called "Trimmed.txt".  The text in this file is just the dates and filenames.  The format of the dates is such that the lexigraphical order is also the temporal order, so I can run them through "sort" without any other processing.  The first line in the reverse sorted list is the latest file, which is then displayed with "head", and the string representing the filename is used as the source argument for the bq load command.  The kind being processed is loaded into a dataset called "automation", in a table matching the name of the kind.


Script:  Clean up in BigQuery


The next step is to clean up the data for export.  The upload tool I used in the previous posts was very particular about formating, and I pretty much assumed that any command line tool I used would be just as finicky.  I started with the queries I described in my blog on extraction, and ultimately had to make two changes (which probably related to the failures I was fighting with that lead me to convert the file to an Excel spreadsheet before):
  • BigQuery was exporting boolean values as "true" and "false", however I was trying to import these into a "bit" type, which requires 1 and 0 respectively.  A simple CASE statement made the conversion.
  • The SQL tools were very sensitive to timestamp format.  The BQ timestamp includes "UTC" at the end, which was blowing up the type cast.  Another BQ statement would give a human readable string that was closer to the right format, but the seconds included 6 decimal places, while SQL Server only supported up to three.  I used substr to cut off the last three zeros, and then the cast worked fine.
The basic script looks like this (the other queries are super ugly...):

@ECHO OFF
%=----------------------------------=%
%=  PREPARE THE DATA FOR EXPORT     =%
%=  Tables in 'automation' dataset: =%
%=    HtsTicketAssignment           =%
%=    OrgGroup                      =%
%=    Ticket                        =%
%=    UserProfile                   =%
%=----------------------------------=%

call bq --project_id=wyoorgdev rm -f automation.OrgGroup_Export
call bq --project_id=wyoorgdev query --destination_table=automation.OrgGroup_Export "SELECT supervisor, name FROM [automation.OrgGroup]"

The first call to bq deletes the target table.  This proved necessary because running the query with a destination table (the next line) would throw an error if the table already existed.  This may also be possible with the "replace" option, but I didn't try it (didn't notice it was an option till I wrote this up lol).  Running bq query with a destination table specified will store the results in that table, which will facilitate the next step.


Script: Export to Cloud Storage


Exporting the cleaned up data from BigQuery to Cloud Storage is pretty simple, as it requires just one command.

@ECHO OFF
%=------------------------------------=%
%=  EXPORT DATA TO CLOUD STORAGE      =%
%=  Tables to export                  =%
%=    HtsTicketAssignment_Export      =%
%=    OrgGroup_Export                 =%
%=    Ticket_Export                   =%
%=    UserProfile_Export              =%
%=------------------------------------=%

call bq --project_id=wyoorgdev extract --noprint_header --destination_format=CSV automation.HtsTicketAssignment_Export gs://wyoorgdev_backup_automation/HtsTicketAssignment_Export.csv

The bq extract command does the job.  The "noprint_header" option prevents the export from including the column names as the first row of output.  The table being extracted and the file being extracted to are specified, and that's it.  If the file exists, it is overwritten, which is the desired behavior in this case.


Script: Load into Azure SQL


The most difficult and time consuming part of the process to get right was the load into Azure (I know, you're shocked).  The script for this part looks like this:

@ECHO OFF
%=----------------------------------------------=%
%=  DOWNLOAD CSV FILES                          =%
%=  Bucket                                      =%
%=        wyoorgdev_backup_automation/          =%
%=  Files to download                           =%
%=    HtsTicketAssignment_Export.csv            =%
%=    OrgGroup_Export.csv                       =%
%=    Ticket_Export.csv                         =%
%=    UserProfile_Export.csv                    =%
%=----------------------------------------------=%

call gsutil cp gs://wyoorgdev_backup_automation/HtsTicketAssignment_Export.csv  .
call gsutil cp gs://wyoorgdev_backup_automation/OrgGroup_Export.csv  .
call gsutil cp gs://wyoorgdev_backup_automation/Ticket_Export.csv  .
call gsutil cp gs://wyoorgdev_backup_automation/UserProfile_Export.csv  .

%=----------------------------------------------=%
%=  PUSH TO AZURE SQL                           =%
%=  Server address                              =%
%=        htsdata.database.windows.net          =%
%=----------------------------------------------=%

%=----------------------=%
%=       OrgGroup       =%
%=----------------------=%
sqlcmd -U jenkins@htsdata -S tcp:htsdata.database.windows.net -P %PASSWORD% -d HTSData -Q ^
 "IF (EXISTS (SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'dbo' AND TABLE_NAME = 'OrgGroup')) BEGIN DROP TABLE OrgGroup END CREATE TABLE [dbo].[OrgGroup]( [supervisor] varchar(255) NULL, [name] varchar(255) NULL)"

bcp dbo.OrgGroup in OrgGroup_Export.csv -U jenkins@htsdata -S tcp:htsdata.database.windows.net -P %PASSWORD% -d HTSData -t, -r 0x0A -c -k -e errorOG.txt

The first step is copying the .csv files from cloud storage into the local directory.  Once the files are downloaded, two steps are necessary to transfer the data to SQL Server: first the tables must be created with the correct schema with sqlcmd, then the bulk copy tool bcp is used to load the data.  There were plenty of "gotchas" getting this part right:
  • The server specified using the -S option must start with "tcp:" with SQL Server Native Client 10.0.  Failing to do that would cause it to act like it couldn't find the server.  It seems that with version 11.0, this isn't an issue...
  • The .csv file exported from BigQuery uses Unix style line endings (just LF).  The bcp command, by default, expects Windows style line endings (CRLF).  This can be changed with the -r option, but specifying just LF as the row delimiter was tricky.  The solution I found on a help thread was to use the hex code for the LF character: 0x0A.
  • The -c option on bcp specifies that the data is character data.  It won't work without this one.
  • The -k option tells bcp to keep null values, also necessary for it to work right (probably...)
  • Specifying an error log file with the -e option proved instrumental in figuring out that bcp was choking on boolean and timestamp values.
  • Setting up the account for jenkins on the Azure database took several steps.  I don't know that the login AND user are strictly necessary, but I created both.  I did find that you can't give the account access using GRANT ALL, you have to use the sproc:
    • Create a "contained" user
      CREATE USER jenkins WITH PASSWORD = {random password}
    • Create a login
      CREATE LOGIN jenkins WITH PASSWORD = {same random password}
    • Grant access (via role)
      EXEC sp_addrolemember 'db_owner', 'jenkins'

Launching the scripts from Jenkins


I had my scripts, and they all ran from my laptop.  I saved them to Github and called it a night.  The next day, it was time to make the automation happen.  I figured there would be a bit of environmental set up that needed to happen, so I installed an RDP client on my home laptop (got snowed in, so I was working from home) and connected to my windows server slave machine.  Since the scripts were written as batch files and sqlcmd and bcp are windows utilities, using the windows machine was necessary.

First order of business was updating the PATH variable to include the git/bin folder so I could access rm and head.  I installed the Google Cloud SDK for gsutil and bq, which also installed Python 2.7.  Following installation, I went through the authentication process so that the machine could access the necessary Google resources.  Finally, I installed the Microsoft Command Line Utilities for SQL Server for sqlcmd and bcp.  

I was initally frustrated by several failed builds that acted as though the utilities and path changes I'd made to the slave machine weren't registering, even though I'd restarted the slave.  Turns out you also have to restart Jenkins.  Once that was finished, the jobs worked as expected.  Each script is run as a seperate job (4 total), which are chained together using a Post-Build task.

No comments:

Post a Comment