ETL is a tool that allows the admin to perform bulk or time consuming operations that may affect multiple records in one go.
There are unlimited examples of this, but for an introduction the following to name a few:
Bulk import Data from Excel : The platform can directly import data from Excel files, in a format the business feels comfortable with. The platform does not force any type of format onto the business and can be customized to accept any column field mapping structure. However once a format is decided, then the admin is expected to provide data in the same format each time.
Image resize for multiple products : Websites demand optimized images for various pages. It is manually very difficult to sit an optimize each image. This entire process can be automated.
Automated image association using File Name to the product : For example we can have a scheme that relates the SKU Code as a file name to the product. The shcemes can get very complex. Another example maybe, where the same model product is available in different sizes but the image is the same for them all.
Reload Shipping rules/logic on the fly without restarting the system
The ETL Flow
While the ETL processes can be many and vary from project to project. They all follow the same flow.
Select the Reader :
Each type of ETL process has a READER associated with it. For example Excel Reader or an Image Reader etc. The admin should pick the ETL Process for the job they want to conduct.
Select the Mode :
Depending on the type of job/reader, the admin may want to perform one of the operations mentioned below.
INSERT : Used when you want to only add new Products or Data into the system. If the data contains existing items, it will compain but ignore those records. So if you have a Excel with old and new data, and are worried that the OLD data should be ignored then INSERT is the mode to choose.
UPDATE : Will only consider existing records from the input data and will ignore any new records the process encounters.
INSERT + UPDATE : When not sure and your data is a monolithic file of old and new records, the suggested order is to do an ETL first on INSERT mode and then in UPDATE mode.
DELETE : To clean/delete multiple specific records in Bulk. Note: When you delete a product it only deletes the product from the database, it does not delete its images. That is not necessary, however over time one may want to clean it t osave disk space and improve performance using FTP directly.
Conditional Modes : On selecting a reader, the ETL admin may further influence what type of operations are available. For example, in an Image resize option there is no requirement for any Mode to be specified and hence you will not be give any choice to select any Mode.
Choose to verify first : If there is any input data like in the case of Excel file, one may want to check the file for any faults in its format before proceeding with the actual update. If the Checkbox is selected, it will only verify the file but not commit any of its data to the system. For any Reader if you want the system to make updates please ensure the verify option is NOT SELECTED, once yo uare confident that the data is ready.
Upload File / Provide additional parameters : Depending on the nature of the Reader you may have to choose a File to upload or pass additional data. This depends on the Reader selected.
Process : Once all the other options have been selected, simply press the Process button. This will start a progress bar from 0% to 100%. On clicking please note the Process ID and date/time mentioned. It looks like this: Run @ 05-Jul-2015 11:02:31 PM with Id 626957. If there is a problem support can identify the process if you communicate the same to them. You should also get an email about the same details that the process has started on your registered admin account email address (can check on Systems or Home page what the email ID is).
Post Process :
PASS : If the process goes to 100% without any error, you will also get an email stating the process as successful. The subject of the email will mention the word ETL or process or both
FAIL : if there is a failure in the process, it will show the reasons for failure. You can try to fix it or report it to the technical person incase you are not able to understand or fix the issue yourself. Be sure to communicate the Process Id or forward the email that contains these details.
Hang State : Sometimes a process may seem its stuck or taking a long time. If you have got an email then it means it just the user interface that has got disconnected. If not, treat it as a FAIL case and follow the advice in FAIL.
The excel reader is used to import data in bulk from an Excel file from your local computer. You can choose any of the modes mentioned.
Please also note, that the excel reader can be used for other purposes (if they apply to your application); like loading COD pin-code and cash limit data etc and not just products.
If the reader encounters an error, it will either be a WARNING or a CRITICAL ERROR. A warning will indicate something wrong in a particular row but it will carry on. For example, a column where Price was to be a number but instead you mentioned the SKU code text.
Also note the numbering starts from 0. So if it complains there is a probblem on row 5 column 6, in human terms thats the 6th row and 7th column.
Image Mapper & Reader
Map images to their entities
Once the images are ready, the system copies the optimized versions to various folders that the website recognizes/accesses. Now, we are ready to map the image(s) to their respective products or entities.
For this you must ensure the Image file name matches the convention given for your application. If the system finds any images that do not map to any product, it will warn you but that is not considered as a critical problem and hence it can continue to map the other images.
You must ensure though that all the products are updated in the System before you run this. This is because the ETL process needs to know the SKU or the product identifier to map it to the image. If you have not INSERTED the record or product before, then the system has no way of knowing the image belongs to which process.
This type of reader, goes through all the images in the FTP upload folder and starts compressing them one by one if they are not already compressed. For this reason this reader may take time to complete. Of all the readers, this is the slowest.
Clean Redundant images & Unassociate Readers
Clean and remove any existing mappings between images and their products or entities
This is optional, and only removes any product associated with their images. This is useful if you want to simply clean or remove any product images from existing products.
Clean does not always imply the file or image itself will be deleted, it merely means that the association between the images and their entities are broken. By convention "Clean Mapping" or "Unassociate" means to simply break the relation, while "Clean Redundant" means the reader actually will delete the file. You can check the Reader description or check with your administrator to know which does what.
Clean Dirty images
Images and files may be copied and optimized, leaving them in multiple folders. A Dirty image, is one that is left over in an intermediate folder but no longer exists in the source folder. Cleaning dirty images hence implies, this action will clear the derivate folder completely, so that all images in source are the ones to be considered for the target.
Lets assume the following structure exists:
An UPLOADS folder (For User to organize their images the way they like) : site-content/catalog/original/uploads, A STAGING FOLDER (For the System to flatten any user defined hierarchy and remove duplicate files with same names across user folders) : site-content/catalog/original/all A TARGET folder (with final optimized images for the site)
Images are taken from the UPLOADS folder and flattened into the STAGING folder. Any optimization process then takes images from the STAGING folder into TARGET folder. This gives 2 benifits. (1) If the UPLOADS folder is cleared, the STAGING retains old images. (2) Weed out duplicates. Sometimes, however the STAGING folders may cause an old image with the same name to persist. Using this action will clear it.
Empty the entire catalog data, except categories. All products will be deleted, leaving only the caqtegory skeleton.
Avoid this, unless you really know what you are doing!
There can be more types of readers/processes. Those will be for more advanced cases and be available in the documentation of various other modules that provide the reader for bulk operatiosn concerning them. Example: Content CDN, Scripting etc.
Catalog update steps
A typical Bulk update on your catalog based site will include the following steps:
Note on Post ETL issues : There can be issues related to you not being able to see changes post ETL process. Like you could not see the updates reflected on the site after making them or images are not showing up etc. Those issues are not really part of the ETL process and you can read more about them here.
While processing any tabular data like Excel sheet or file; the ETL process will clearly list any errors in the process.
You can always check the ROW and COLUMN index to know where the error came from. This shoudl clue you to the origin of the problem if not the reason.
ROW and COLUMN index numbering starts from 0 (zero). So [3,4] actually means the 4th row and 5th column in the Sheet.
Some cases the column index may show up as -1. This means there was a problem in deriving some data. For example lets assume a category CODE was provided. The category CODE in the Excel maybe correct, but due to some issue in the system it was not able to find the category related to that code. As you can see some errors can be a result of intermediate problems and the system may prefer not to blame any particular column for it (hence -1). Intermediate derivations use Mappers so it will also mention a mapper name. The mapper name gives a clue to what information it was trying to derive. Example: parentId means, it was trying to determing the parent category of the product.
Null is a technical term for 'Not Found' or 'Does not exist'
Additionally, you should copy the Process Id and time on the screen. You will also get an email for the ETL process that you can forward to the admin/developer if you are not able to understand the reason for any particular error.
Get an error like the following when uploading an Excel Sheet or File:
ProcessId:725521; Critical error, aborting process : Problem with file, technical details - org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
Excel itself can have various file formats in different versions. Like XLS, XLSX etc. Within them there maybe variations depending on the type of Office used, some of them the software may not support yet or maybe the ETL reader is configured to work with one and not the other. Either way, to work around this, simply Save As XLS format and try again with the new file.
Get an error like the follwing when uploading an Excel Sheet or File:
Example: Duplicate entry 'Tori Grey Napkin-NR15-NAP-28A' for key 'name'
This means the record already exists and you are running in INSERT mode while this should be run in UPDATE mode. You can safely ignore this if you intend to use a Sheet for both INSERTs and Updates.
Get an error like the follwing when uploading an Excel Sheet or File:
Example: Rejected attempt to modify Tori Grey Napkin-NR15-NAP-28AXXX when it does not even exist in the system. Try to INSERT the record before you attempt to UPDATE it!;
This means the record does not exist but the ETL process was in UPDATE mode trying to update it. You can safely ignore this if you intend to use a Sheet for both INSERTs and Updates.
Invalid Lookup or Category
Some errors are a result where the error will mention the words parentId or category. You have specified an invalid category or a category code that does not exist or is not correctly mapped in the system. You need to fix your data and mantion the correct category name or code.
Rules on providing Data to the system
During the course we have tried to guide many times. For the sake of documentation here are some of the most important rules (if not all).
These rules are governed by logic and reason and not something that we wish to enforce out of whim. Because a computer system is a logical system, it needs things defined in a particular way.
The following also assumes the format of your data is already agreed with the developer and once agreed you are in a phase of providing it to the system. The nature of these rules is general. While we can entertain custom requests, those have to be communicated by the client to us early on in the design phase. We have made a very flexible system, however every system has some boundary. The following points apply excluding any pre-agreed complexities in written:
Consistency: For any field (column) data must always be consistent format and expectation. Format means the same pattern is used. For example if date is "dd/mm/yyyy" then all fields concerning date should be same format. Expectation is what we expect. For instance if we expect a DATE to be stored then DATE is what we get not TIME, not DATE-TIME but Date
Similarly, if you are using a field that can only have 2 values. Say "National or International", "Yes or No", "True or False" then only use those one of 2 values i each pair applicable. For example using "India" where the field Header says "national / international" is wrong.
Another example is Size. You cannot mix size types like in the same field have say "Large" and in another row use "23 x 45". For the system this becomes nonsensical and at best it can be treated as raw text (word) not meaningful data.
Data Vs Text: This is perhaps one of the most important. Anything thats put in can be useful in 2 ways:
It only needs to be Displayed when the record is shown in the exact same way its entered
It also needs to be searched upon or be useful in more advanced ways in the future. (Search being an example but not the only one)
Text is what you just want to put in, but dont care how its processed. You just care about showing the same thing exact when the record is pulled up.
Data is when the piece of information is useful in addition to just being displayed. Data needs to be Accurate, because its part of a relation. An inaccurate data piece can result in an error or the intended logic not working.
For example: Lets say "tomorrow" (as the example am about to give which is not a feature today). You want statistics of how many albums were sold by say the singer "Kishor Kumar". This requires the system to recognise multiple things:
Kishor Kumar is a Person of type Singer.
Albums (Artifacts) where Kishor Kumar is set as the Singer.
The above is an example where the name "Kishor Kumar" is not just a word (TEXT) but DATA since it can be related to other DATA. in very simple words TEXT cannot be related to anything in the system while DATA can!
The simple rule is, if the value in a cell requires you to explain it to a Human being, rest assure a computer is not going to get it! .. and any such value is treated as TEXT at best.
Another example is Size. If you design a field to be size and enter data as 34 x 56. If the format is consistent and you intend the first number to say be length and second to be width (for all data consistently) then the system can treat it as data and will know that 34 here represents length. Else as TEXT its a only a word worthy of display. A product can only have one pure size field.
Format: part of Consistency, format also includes things like Units. For example if a DATA column is Length (mm), then:
All fields that are not blank will be an integer
The word mm does not need to be part of the cell, as its understood all are mm. If all are not mm anyway that is breaking consistency.
If there is a need to mix data then that needs to be discussed or treated as TEXT.
Blanks: Many times some data maybe missing. You can just leave it blank. Do not write words like NA, Not there etc.
Singular Vs Plural: If the field is to accept multiple values .. please please , ensure the field name either ends with "s" or "(s)" this will tell us its expected to be plural. And then in this case if you supply a comma spearater list, then we know we have to break it apart. For this it is also important that the delimiter (,) is not part of the word.
However if the column is intended for a Single value at a time only; and randomly you throw in a comma separated list; it will be considered as a single word.
A good practice would be to say put an (s) with the column header so its clear to us that you intend this to be a comma separated plural list. Example director(s), producers(s) etc.
Date & Time Formats: Date is a complex field. As a date can have various formats. In addition to the format you also need to decide the field type. Whateve the format the main common rule for all types is that all cells should be the same field format. Either Text, Number, or Date.
The image below shows how a mixed date format is defined using Date type, which is incorrect. Consult list after diagram for recommendations:
Recommended types for various formats (left is most preferred):
Full date format like dd/mm/yyyy: Date, Text
Year: Number (Integer), Text
Mixed: Text Only
Month Year like mm/yyyy: Date, Text
Month formats omit date, however by default we store it as Date. Means silently a day (01) is added by default. We hide it. But if your data contains mixed format, the output or export will again always omit the day. Because the format specified month and year, thats what you will get even if you provide it ocassionally in the input.
Time: Text, Date (Time format)
...Please again note, if you select one format it should be consistent for all cells of that type
Accuracy: The system can only check format issues; it cannot determine if the data itself being entered is correct or not. So for example, if you have a field "Author" and in author you decide to write "Arjun Dhar & XYZ" , the system will recognise it as one word; it cannot make sense of anything in it. That is what you put in is what you get. The system cannot break it for you based on an unspecified context thats only in your mind. ... because it has no way of knowing what you intend. It can only trust what it sees in a cell value as a value.
One-to-One Mapping: A field or column can only map to one type of data. For example if you say mix producer(s) and director(s) in one column or field. Then its humanly and logically impossible to distinguish which name is intended for which exact field. So either then the data is mapped to one or both. But it cannot on its own map to the correct one, as it has no way of knowing. It is best to avoid mixing data types as you can get into serious trouble if one has to distinguish between the two later.
A another finer example is, assume you have producer(s) who are people but you also have production houses (which are organization); then it is recommended that they be separate columns as the nature of the data is itself different. Unless the system already has context based on name (like a pre written authorities file); it will not know which is a person and which is an organization.
Category data: Some DATA like Category name or code, Reference Code, Sku Code, Product Code need to be Accurate and unique. This means lets say you have a sub-category called Shirts. And a Category for Men and one for Women. Both have the sub-category shirts. If in your format; you choose to only use the category name then the names obviously have to be different as the system will not know if the shirts category you are referring to belongs to men or women. The following are possible solutions (@ format design stage):
Simply keep each Category or Sub-Category name unique so there is no chance of collision.
Maintain multiple columns for category, describing the entire hierarchy. So then a sheet would not just contain the sub-category name but in this example 2 columns. Parent Category, Sub Category.
Dont use names at all, and use product-code instead.
Please note, any of the three methods have to be pre-decided before you fill data and the method has to be Consistent across all categories and sheets of data.
Prevent duplicates & De-Duplication: If data is aggregated over time and from multiple sources, its possible data can be duplicated. If you wish to avoid duplicates or avoid noise the following are possible options:
Data not Text: The field in question then must follow rules of it being data and not text. the difference has been explained above.
Desgined to reject duplicates: Tell us and we can prevent the system from accepting duplicate values for anything you wish. Again this needs to be before you start entering any data and not an after thought.
Specify some custom algorithm that can identify and remove duplicates ones they exist: Here you have to provide us the logic or work with us to identify how to remove them rather than merely preventing them. This is a more complex and effort based activity.
Abbreviations: Abbreviations are hard, and without any additional context they are treated as words. Abbreviations are harder to search and relate to also, and one should try to avoid abbreviations as much as possible. The system can only cope to an extent with abbreviated words.
One Sheet per file: When dealing with excel data, please maintain one sheet per file. There are multiple reasons for this, one being that it keeps the file size minimum per upload for what is required to be uploaded only.
Dont mix file types: The system expects one file to have one format.
Sheet Format & meta-data: The system does not read comments , colors etc. It only reads Column and row cell values. Do not merge columns are cells either. You can put comments and colors in files, however they are ignored by the system.
Column header names, comments, annotations, colors, fonts or any formatting @ text level are all ignored while importing data. While extra information is always valuable as it explains the data well. The point of caution and reminder here is that any formatting, comments etc should be put on the Header as that is all we will humanly evaulate. It is not possible to read every row or cell or field of data hence any formatting other than the header will be missed by human and machine both. If its for your own notes, thats fine but we will not be able to process it.
Frequently Asked Questions (FAQ)
What happens if I shut my window while the ETL process is running?
Nothing happens to the process. Your window is just a monitor. The process runs on the server and once started is unaffected by anything you do on the admin panel, so you dont need to worry.
Can I cancel an ETL process?
How is the Process able to identify between existing and new records?
Every record has some unique way to identify the record. It chould be the SKU or Product Code, or a combination of both, or the file name in case of an image etc.
The system is configured to identify the record by its primary id's and then on UPDATE it simply updates the fields that are not configured to be primary.
During UPDATE what if I have changes any of the primary id fields?
The system will consider it to be a new record, and complain that you are trying to update a records that does not even exist and hence you should insert it first.
I pressed PROCESS but nothing happened
The click may not have been proper. Click it again. The progress bar will start showing a loading icon the moment the process actually starts. If you do not see any loading icon means the process never started.
Should I wait infront of my computer, if a process takes long?
No, as mentioned the ETL process does not run on your machine. It runs on the server. If you have the process ID you can even shut the window and wait for the email. If you get no email or information, you can check with the super admin.
If the Excel has some columns present and some not, what will happen?Excel is partially updated column wise
It depends. Some columns are mandatory while some are not. Again what is mandatory depends on INSERT or UPDATE. UPDATE is more lenient. The columns that are mandatory are different for each project. You can check with the developer or admin.
Is there a way of bulk updating details of products without the master sheet?
One can use the admin panel to do manual updates, but thats an inefficient process for multiple products. There is nothing really recognized as a "Master List". However if you are not regularly maintaining your excel, you can always export products from the Catalog Listing. The catalog export, available in Catalog section, is designed to export in the format the ETL import works. So one can simply export, modify and re-upload via ETL.
Would it work if we entered the SKU code or the style number and just the categories that need to be updated in an excel sheet and upload that?
By default you have one ETL upload format in which all the necessary fields should be provided for update. However we can create alternative formats if for any reason you want to reduce columns or reduce the amount of data available for someone uploading a particular sheet. An example could be a simple 2 column sheet with SKU code and Inventory position and nothing else. That is possible but is not part of the standard set of Readers as such requirements are project based. Though if needed its not difficult to accomplish.
For import what file formats are supported?
Excel (XLS, XSLS), CSV (Comma Separated Values). Though by default applications are configured and tested to work straight with Excel (XLS).