I am writing a python script that analyses a piece of text and returns the data in JSON format. I am using NLTK, to analyze the data. Basically, this is my flow:
Create an endpoint (API gateway) -> calls my lambda function -> returns JSON of required data.
I wrote my script, deployed to lambda but I ran into this issue:
Resource \u001b[93mpunkt\u001b[0m not found. Please use the NLTK Downloader to obtain the resource:
\u001b[31m>>> import nltk nltk.download('punkt') \u001b[0m
Searched in: - '/home/sbx_user1058/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - '/var/lang/nltk_data' - '/var/lang/lib/nltk_data'
Even after downloading 'punkt', my script still gave me the same error. I tried the solutions here :
Optimizing python script extracting and processing large data files
but the issue is, the nltk_data folder is huge, while lambda has a size restriction.
How can I fix this issue? Or where else can I use my script and still integrate API call?
I am using serverless to deploy my python scripts.
1 Answers
Answers 1
There are two things that you can do:
The errors seems like the path is not being defined properly, maybe set it as an env Variable?
python sys.path.append(os.path.abspath('/var/task/nltk_data/')
or this way
Once you run
nltk.download()
, then copy it to the root folder of your AWS lambda application. (Name the dir to be called "nltk_data".)In the lambda function dashboard (in the AWS console), add
NLTK_DATA
=./nltk_data
as a key-var Environment Variable.
reduce the size of the nltk downloads, since you won't be needing all of them.
Delete all the zip files, keep only the needed section, for example: stopwords. That can be moved into:
save nltk_data/corpora/stopwords
and delete the rest.Or If you need tokenizers save to
nltk_data/tokenizers/punkt
. Most of these can be separately downloaded:python -m nltk.downloader punkt
, then copy over the files.
0 comments:
Post a Comment