Sunday, October 29, 2017

Lambda not supporting NLTK file size

Leave a Comment

I am writing a python script that analyses a piece of text and returns the data in JSON format. I am using NLTK, to analyze the data. Basically, this is my flow:

Create an endpoint (API gateway) -> calls my lambda function -> returns JSON of required data.

I wrote my script, deployed to lambda but I ran into this issue:

Resource \u001b[93mpunkt\u001b[0m not found. Please use the NLTK Downloader to obtain the resource:

\u001b[31m>>> import nltk nltk.download('punkt') \u001b[0m
Searched in: - '/home/sbx_user1058/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - '/var/lang/nltk_data' - '/var/lang/lib/nltk_data'

Even after downloading 'punkt', my script still gave me the same error. I tried the solutions here :

Optimizing python script extracting and processing large data files

but the issue is, the nltk_data folder is huge, while lambda has a size restriction.

How can I fix this issue? Or where else can I use my script and still integrate API call?

I am using serverless to deploy my python scripts.

1 Answers

Answers 1

There are two things that you can do:

  1. The errors seems like the path is not being defined properly, maybe set it as an env Variable?

    python sys.path.append(os.path.abspath('/var/task/nltk_data/')

or this way

  1. Once you run nltk.download(), then copy it to the root folder of your AWS lambda application. (Name the dir to be called "nltk_data".)

  2. In the lambda function dashboard (in the AWS console), add NLTK_DATA=./nltk_data as a key-var Environment Variable.


  1. reduce the size of the nltk downloads, since you won't be needing all of them.

    1. Delete all the zip files, keep only the needed section, for example: stopwords. That can be moved into: save nltk_data/corpora/stopwords and delete the rest.

    2. Or If you need tokenizers save to nltk_data/tokenizers/punkt. Most of these can be separately downloaded: python -m nltk.downloader punkt, then copy over the files.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment