Asterisk leaking resources? No, it's gsutil :'(
Dec 5, 2018 10:00 · 873 words · 5 minute read
This week we solved a resource leak in our asterisk images reserved for call handling. Here’s the path we follow to mitigate the issue, which we find out to be related with the gsutil command line tool.
We recently switched our infraestructure and applications to Google Kubernetes Engine, including our telephony stack.
One of the main problems of the migration were the storage volumes provided by GCP. In the old architecture, running Kubernetes on top of Rancher, we use a NFS server to provide storage volumes for our pods, mounted as ReadWriteMany. This works well for the asterisk call recordings because we mounted the same volume in all asterisk pods, since the NFS server handle the write calls.
When we move to GKE, the volumes available are raw disks, and we can only mount those for writing in a single node. The solution we came across is to use Google Storage Buckets.
1. Uploading files with gsutil in production: a big mistake
Btw I don’t know why I thought this would be a good choice…
The initial idea was to code a simple tool to upload files to the bucket. Since I saw the Gsutil command line tool, it looked perfect for the job.
We already have some routines to handle the call recordings in an asterisk stasis, basically audio conversion using sox. The only changes we made was to install the Google Cloud SDK(already contains the gsutil tool) in the image, along with a valid service account with write storage permission in runtime.
This seems great and in fact it works well, until we saw the resource usage in production… The pods constantly went to an Evicted
state, basically exausting the node resources until the scheduler decides to remove the offending pod, which happens a lot, basically after every half hour. Since the instances keep the call sessions, when each pod exits the calls are gone for the clients :'(.
Here’s the load of the running pods. You can see that each one is using about 3vCPU. The ram usage seems lower, but the majority of the evicted pods are due to nearly OOM failures.
$ kubectl top pods | grep asterisk-calls
mask-asterisk-calls-655c957dcb-5d8vm 3924m 305Mi
mask-asterisk-calls-655c957dcb-cn5ss 3194m 192Mi
mask-asterisk-calls-655c957dcb-4cdq2 3256m 337Mi
mask-asterisk-calls-655c957dcb-gskj6 3717m 619Mi
mask-asterisk-calls-655c957dcb-gwsdx 3936m 715Mi
mask-asterisk-calls-655c957dcb-n6gt7 3890m 256Mi
mask-asterisk-calls-655c957dcb-pkgql 3950m 239Mi
Initially I though that was an asterisk leak of some sort, then after exec(ing) in some of the pods to inspect what’s happening we find a lot of python processes still running, waiting for an event to be completed, all of these created by gsutil. Then we find out that the resource leak was indeed related to the gsutil tool and not asterisk.
2. Getting rid of gsutil
We already have a golang microservice that’s responsible to serve the call recordings from the bucket, so I suggested to move the “uploading to bucket” role to it, removing the need of gsutil and the cloud sdk.
Using the cloud.google.com/go/storage
package, it acts as a front layer for the storage, providing a local filesystem (for local testing) and a cloud storage implementation (for production).
We use this simple script to upload the recordings using curl:
#!/bin/bash
RECORDINGS_DIR=/var/spool/asterisk/monitor/
if [ "$#" -ne 1 ]
then
echo "Usage: upload-recording.sh ${RECORDINGS_DIR}path/to/file.gsm"
exit 1
fi
RECORDING_PATH="$1"
RECORDING_RELATIVE=${RECORDING_PATH#"${RECORDINGS_DIR}"}
if [ -e "$RECORDING_PATH" ]; then
curl -XPOST \
-F "filename=${RECORDING_RELATIVE}" \
-F "file=@${RECORDING_PATH}" ${RECORDINGS_SERVER}/create && \
rm "${RECORDING_PATH}"
fi
The recordings microservice stands behind a service (internal lb in kubernetes) and with graceful shutdown in the http server, it works pretty well with no downtime.
Here’s the most critical part of handler function that saves the file, as you can see we use io.Copy
to stream the file contents directly to the underlying storage implementation. The io.Reader
and io.Writer
are key interfaces in Go and used pretty much everywhere in the stdlib, from http to crypto packages:
func createHandler(sto storage.Storage) http.Handler {
return http.HandlerFunc(func(rw http.ResponseWriter, r *http.Request) {
err := r.ParseMultipartForm(32 << 20)
if err != nil {
log.Errorln(err)
rw.WriteHeader(http.StatusInternalServerError)
return
}
filename := r.FormValue("filename")
if filename == "" {
log.Errorln("Filename cannot be empty!")
rw.WriteHeader(http.StatusInternalServerError)
return
}
file, _, err := r.FormFile("file")
if err != nil {
log.Errorln(err)
rw.WriteHeader(http.StatusInternalServerError)
return
}
defer file.Close()
writer, err := sto.Put(filename)
if _, err = io.Copy(writer, file); err != nil {
log.Errorf("Failed to copy file contents from request: %s\n", err)
return
}
if err = writer.Close(); err != nil {
log.Errorf("Error while closing writer: %s\n", err)
rw.WriteHeader(http.StatusInternalServerError)
return
}
log.Infof("Saved file %s\n", filename)
})
}
Here’s a canary deploy we made to test out the changes, you can see how big was the resource leak generated by gsutil:
$ kubectl top pods | grep asterisk-calls
mask-asterisk-calls-655c957dcb-5d8vm 3924m 305Mi
mask-asterisk-calls-655c957dcb-cn5ss 3194m 192Mi
mask-asterisk-calls-655c957dcb-gqlvd 347m 92Mi <- canary deploy here
mask-asterisk-calls-655c957dcb-gskj6 3717m 619Mi
mask-asterisk-calls-655c957dcb-gwsdx 3936m 715Mi
mask-asterisk-calls-655c957dcb-n6gt7 3890m 256Mi
mask-asterisk-calls-655c957dcb-pkgql 3950m 239Mi
It’s about 8x less resource usage!
And here is the full production rollout:
$ kubectl top pods | grep 'asterisk-calls\|recordings-server'
mask-asterisk-calls-56c64d497f-2spsf 230m 72Mi
mask-asterisk-calls-56c64d497f-9fh6f 215m 73Mi
mask-asterisk-calls-56c64d497f-fhzxg 340m 75Mi
mask-asterisk-calls-56c64d497f-hb764 374m 80Mi
mask-asterisk-calls-56c64d497f-pdvg5 309m 87Mi
mask-asterisk-calls-56c64d497f-r5gxw 378m 94Mi
mask-asterisk-calls-56c64d497f-z96vm 307m 103Mi
mask-recordings-server-5c74d6b859-jvbzr 19m 69Mi
mask-recordings-server-5c74d6b859-jznsm 20m 69Mi
mask-recordings-server-5c74d6b859-t5bfc 19m 76Mi
mask-recordings-server-5c74d6b859-wbqxs 21m 69Mi
The effect was so big that our cluster size decreased from 8 to 4 nodes, that’s insane. 🤯