2016/07/31

Testing Out Google's Natural Language API

Since today is the last day of the Google Natural Language API's free public beta I thought I'd give it a little spin. One of the applications for the API listed on google's promotional page was analyzing product reviews... which reminded me that late last year I made a Slack webhook that extracts Google Play and App Store reviews and happen to still have a stockpile of those laying around in a database of mine, so what better sample data to use for this experiment?

Apple and Google provide rating data (1 to 5 stars) so I know what percentages of the users gave unfavourable reviews, but it would be a good idea to try to narrow down some of the things users are complaining about. Perhaps that's something we can tackle with this API? Let's try it.

My sample data is stored in a database with the following structure:


sqlite> .schema
CREATE TABLE review (
  id INT(11) NOT NULL,
  title VARCHAR(255) NULL,
  content TEXT,
  rating TINYINT(3) NOT NULL DEFAULT 0,
  device_type TINYINT(3) NOT NULL,
  device_name VARCHAR(255) NULL,
  author_name VARCHAR(255) NULL,
  author_uri VARCHAR(255) NULL,
  created DATETIME NULL,
  updated DATETIME NOT NULL,
  acquired DATETIME NOT NULL,
  PRIMARY KEY (id, device_type)
);
CREATE INDEX updated_idx on review(updated);
CREATE INDEX rating_idx on review(rating);
.

At this point I don't really care about platform (although I certainly could further break down my user samples by device type if I was so inclined), so I'm just going to collect the comments from users who gave a distinctly bad rating (of 1 or 2 stars) to feed to the API with a simple query like this:


SELECT content FROM review WHERE rating < 3;
.

If I throw the main body of the review text at Google's API and see if it'll come up with some salient keywords (and how often they are brought up) perhaps it'll give us a better clue what it is the users are complaining about.

So first we need to authenticate with the API.

Creating a service key file for authentication is straight forward enough, so I'll just link to the documentation here and once you have one of those all you need to do is use the gcloud command to authenticate and print an access-token.


$ gcloud auth activate-service-account --key-file=kinmedai-cb03d32572c2.json
Activated service account credentials for: [user@projectname.iam.gserviceaccount.com]
$ gcloud auth print-access-token
[[output omitted]]
.

Now that I'm ready to access the API, I decided to create a script in Go to do the dirty work for me. So the first step is to use the sample json payload and response body data from the getting started docs to generate structs in Go. Writing structs by hand is a pain, so I used JSON to Go to generate the bulk of it and then tweaked it a bit like so:

The request structure:


type EntityRequest struct {
  EncodingType string                `json:"encodingType"`
  Document     EntityRequestDocument `json:"document"`
}

type EntityRequestDocument struct {
  TypeName string `json:"type"`
  Content  string `json:"content"`
  Language string `json:"language"`
}
.

And the response structure:


type EntityResponse struct {
  Entities []DetectedEntity `json:"entities"`
  Language string           `json:"language"`
}

type DetectedEntity struct {
  Name       string          `json:"name"`
  EntityType string          `json:"type"`
  Salience   float64         `json:"salience"`
  Mentions   []EntityMention `json:"mentions"`
  Metadata   struct {
    WikipediaUrl string `json:wikipedia_url"`
  } `json:"metadata"`
}

type EntityMention struct {
  Text struct {
    Content     string `json:"content"`
    BeginOffset string `json:"beginOffset"`
  } `json:"text"`
}
.

Now that I know what kind of data I'll be dealing with I can start building my request.


func createEntityRequests() []*EntityRequest {
  dbh := getDBH()
  rows, err := dbh.Query(`SELECT content FROM review WHERE rating < 3`)
  if err != nil {
    log.Fatal(err)
  }

  var entities []*EntityRequest

  for rows.Next() {
    var comment string
    err = rows.Scan(&comment)
    if err != nil {
      log.Fatal(err)
    }

    // Google Play lets users submit ratings with no comments (stars only ratings) so skip those
    if len(comment) == 0 {
      continue
    }

    entityRequest := &EntityRequest{
      EncodingType: "UTF8",
      Document: EntityRequestDocument{
        TypeName: "PLAIN_TEXT",
        Content:  comment,
        Language: "JA",
      },
    }
    entities = append(entities, entityRequest)
  }

  return entities
}
.

Next I'll need to create a function that posts to the entities analysis API. Again, the quickstart docs summarize this process very clearly, but it's a basic HTTP post request with a json payload and the access token we got from gcloud set in the Authorization header.

I'm planning on passing the token directly from standard input so I can pipe my script with gcloud, but more on that later. First the request:


func postEntity(accessToken string, entityRequest *EntityRequest) []byte {
  jsonEntity, _ := json.Marshal(entityRequest)
  req, err := http.NewRequest("POST", ENTITIES_URL, bytes.NewBuffer(jsonEntity))
  if err != nil {
    log.Fatal(err)
  }
  req.Header.Set("Content-Type", "application/json")
  req.Header.Set("Authorization", "Bearer "+accessToken)

  client := &http.Client{}
  res, err := client.Do(req)
  if err != nil {
    log.Fatal(err)
  }
  defer res.Body.Close()

  body, err := ioutil.ReadAll(res.Body)

  if res.StatusCode != http.StatusOK || err != nil {
    log.Fatal(res.Status)
    log.Fatal(err)
  }

  return body
}
.

Now you might have noticed that this is not the most efficient way to get feedback since I am sending one request per review. At the moment I only have 220 reviews for my test data, so it's not a big deal, but if I was actually planning on using this on any regular kind of basis it could potentially be very expensive (and also slow) to do it this way. Since we don't need to associate any other of the data with the content we could potentially amalgamate several reviews into one body of text and send the data in a batch.

However at this point I'm not even 100% sure this experiment is going to produce any kind of meaningful result, so for the time being I'm going to analyze each review individually. Better to make sure it works before spending time optimizing it, right?

Anyway, let's piece the rest of the script together. I know I'm going to have to pass the script my API access token as well as the path to the db (I'm using sqlite3), so my interface is going to look something like this:


$ gcloud auth print-access-token | go run kinmedai.go -d /path/to/sqlitedbname.db
.

So for my main block I'm going to grab my two parameters via stdin and flags and build the request payloads. Then I'll post each payload to the API and parse any entities from the http response body into the detectedEntities map that'll be used to count how many times a specific term was referenced:


func main() {
  flag.Parse()

  var accessToken string
  fmt.Scan(&accessToken)

  entityRequests := createEntityRequests()
  detectedEntities := make(map[string]int)

  for i := 0; i < len(entityRequests); i++ {
    var entityResponse EntityResponse
    body := postEntity(accessToken, entityRequests[i])
    json.Unmarshal(body, &entityResponse)

    for j := 0; j < len(entityResponse.Entities); j++ {
      entity := entityResponse.Entities[j]
      detectedEntities[entity.Name] = detectedEntities[entity.Name] + 1
    }
  }

  for k, v := range detectedEntities {
    fmt.Printf("%s: %d\n", k, v)
  }
}
.

Yikes! That's a lot of for loops! But as I said I'm still in the experimental phase so I'm not going to worry about how fast this runs just yet.

And the output looks something like this (did I mention I was parsing Japanese reviews?):


Wi-Fi: 2
GOOGLE: 1
TT: 1
DL: 1
2MWXZH4T: 1
アンインストール: 2
ぼく友: 1
ガラポン: 1
2.7.2: 1
GYH7AFSY: 1
某都: 1
Fuck this shit I: 1
掛布: 1
星飛雄馬: 1
ガチャ: 12
ガチャゲー: 1
C8YENNZG: 1
やめた: 3
ゴミゲー: 1
めちゃ運: 1
っ・ω: 1
ゴールド: 1
ガチャ.: 1
間違いない(* ̄ー ̄: 1
WB5Z2JQX: 1
甲子園: 3
平安: 1
パワプロ: 1
.

So it looks like the results are a bit hit and miss. Things like invitation codes, emoji, etc that probably shouldn't be actual keywords are showing up as entities. However it looks like at least 15 (if you count "ガチャ", "ガチャ.", "ガチャゲー" and "ガラポン" as the same result) out of our sample of 220 users are complaining about the gacha system in this particular app.

I can't exactly say this is a groundbreaking discovery or anything. Having read most of the reviews for the app over the last few months (since my webhook delivers new reviews to my team's slack daily) I pretty much knew that many of the complaints hinged around users not getting the drops they wanted, but I guess this helps quantify it a little better?

Either way it looks like Google's entity search is mostly based around pinpointing terms that can be found on Wikipedia... whereas for an app like the one I run it'd be more useful to do a keyword search for game specific terminology if the goal is to pinpoint features the users might be frustrated about.

But now that I've seen what this API is capable of I'm already starting to think of better applications for this technology (like analyzing followers' tweets to see what other games and manga they are talking about).

I skipped over some details (such as connecting to the db and other things that don't really pertain to using the API), but I've uploaded the script to a gist so feel free to reference it in its entirety in case I lost you at any point.